Methods, systems, and software for identifying functional biomolecules

ABSTRACT

The present invention generally relates to methods of rapidly and efficiently searching biologically-related data space. More specifically, the invention includes methods of identifying bio-molecules with desired properties, or which are most suitable for acquiring such properties, from complex bio-molecule libraries or sets of such libraries. The invention also provides methods of modeling sequence-activity relationships. As many of the methods are computer-implemented, the invention additionally provides digital systems and software for performing these methods.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.11/429,628, filed May 5, 2006, naming Gustafsson et al. as inventors,and titled “Methods, Systems, and Software for Identifying FunctionalBiomolecules” which is a divisional of U.S. patent application Ser. No.10/629,351, filed Jul. 29, 2003, naming Gustafsson et al. as inventors,and titled “Methods, Systems, and Software for Identifying FunctionalBio-Molecules” which is a continuation in part of U.S. patentapplication Ser. No. 10/379,378, filed Mar. 3, 2003, naming Gustafssonet al. as inventors, and titled “Methods, Systems, and Software forIdentifying Functional Bio-Molecules.” U.S. patent application Ser. No.10/629,351 is also a continuation in part of International ApplicationNo. PCT/US03/06551 filed Mar. 3, 2003, naming Gustafsson et al. asinventors. Both U.S. patent application Ser. No. 10/379,378 andInternational Application No. PCT/US03/06551 claim the benefit under 35U.S.C. §119(e) of U.S. Ser. No. 60/360,982, filed Mar. 1, 2002. Each ofthe above documents is incorporated herein by reference in theirentireties and for all purposes.

COPYRIGHT NOTIFICATION

A portion of the disclosure of this patent document contains materialwhich is subject to copyright protection. The copyright owner has noobjection to the facsimile reproduction by anyone of the patent documentor patent disclosure, as it appears in the Patent and Trademark Officepatent file or records, but otherwise reserves all copyright rightswhatsoever.

FIELD OF THE INVENTION

The present invention relates to the fields of molecular biology,molecular evolution, bioinformatics, and digital systems. Morespecifically, the invention relates to methods of identifyingbiomolecule targets with desired properties and methods forcomputationally predicting the activity of a biomolecule. Systems,including digital systems, and system software for performing thesemethods are also provided. Methods of the present invention have utilityin the optimization of proteins for industrial and therapeutic use.

BACKGROUND

Protein design has long been known to be a difficult task if for noother reason than the combinatorial explosion of possible molecules thatconstitute searchable sequence space. The protein design problem wasrecently shown to belong to a class of problems known as NP-hard(Pierce, et al. (2002) “Protein Design is NP-hard,” Prot. Eng.15(10):779-782), indicating that there is no algorithm known that cansolve such problems in polynomial time. Because of this complexity, manyapproximate methods have been used to design better proteins; chiefamong them is the method of directed evolution. Directed evolution ofproteins is today dominated by various high throughput screening andrecombination formats, often performed iteratively.

Sequence space can be described as a space where all possible proteinneighbors can be obtained by a series of single point mutations. Smith(1970) “Natural selection and the concept of a protein space,” Nature,225(232):563-4. For example, a 100 residue long protein would be a 100dimensional object with 20 possible values, i.e., the 20 naturallyoccurring amino acids, in each dimension. Each one of these proteins hasa corresponding fitness on some complex landscape. Models of such“fitness landscapes” were first studied by Sewall Wright (Wright (1932)“The roles of mutation, inbreeding, crossbreeding and selection inevolution,” Proceedings of 6^(th) International Conference on Genetics,1:356-366) but have since been expanded on by others (Eigen, M. (1971)“Self organization of matter and the evolution of biologicalmacromolecules,” Naturwissenschaften, 58(10):465-523; Kauffman, S. etal. (1987) “Towards a general theory of adaptive walks on ruggedlandscapes,” J. Theor. Biol., 128(1):11-45; Kauffman, E. S., et al.(1989) “The NK model of rugged fitness landscapes and its application tomaturation of the immune response,” J. Theor. Biol., 141(2):211-45;Schuster, P., et al. (1994) “Landscapes: complex optimization problemsand biopolymer structures,” Comput. Chem., 18(3):295-324; Govindarajan,S. et al. (1997) “Evolution of model proteins on a foldabilitylandscape,” Proteins, 29(4):461-6). The sequence space of proteins isimmense and is impossible to explore exhaustively. Accordingly, new waysto efficiently search sequence space to identify functional proteinswould be highly desirable.

SUMMARY

One aspect of the present invention pertains to methods, apparatus, andsoftware for identifying amino acid residues for variation in a proteinvariant library. These residues are then varied in the sequences ofprotein variants in the library in order to affect a desired activitysuch as stability, catalytic activity, therapeutic activity, resistanceto a pathogen or toxin, toxicity, etc. The method of this aspect may bedescribed by the following sequence of operations: (a) receiving datacharacterizing a training set of a protein variant library; (b) from thedata, developing a sequence activity model that predicts activity as afunction of amino acid residue type and corresponding position in thesequence; and (c) using the sequence activity model to identify one ormore amino acid residues at specific positions in the systematicallyvaried sequences that are to be varied in order to impact the desiredactivity. In this method, the protein variants in the library may havesystematically varied sequences. Further, the data provides activity andsequence information for each protein variant in the training set.

In some embodiments, the method also includes (d) using the sequenceactivity model to identify one or more amino acid residues that are toremain fixed (as opposed to being varied) in new protein variantlibrary.

The protein variant library may include proteins from various sources.In one example, the members include naturally occurring proteins such asthose encoded by members of a single gene family. In another example,the members include proteins obtained by using a recombination-baseddiversity generation mechanism. Classical DNA shuffling (i.e., DNAfragmentation-mediated recombination) or synthetic DNA shuffling (i.e.,synthetic oligonucleotide-mediated recombination) may be performed onnucleic acids encoding all or part of one or more naturally occurringparent proteins for this purpose. In still another example, the membersare obtained by performing DOE to identify the systematically variedsequences.

Generally, the sequence activity model may be of any form that does agood job of predicting activity from sequence information. In apreferred embodiment, the model is a regression model such as a partialleast squares model or a principal component regression model. Inanother example, the model is a neural network.

Using the sequence activity model to identify residues for fixing orvariation may involve any of many different possible analyticaltechniques. In some cases, a “reference sequence” is used to define thevariations. Such sequence may be one predicted by the model to have ahighest value (or one of the highest values) of the desired activity. Inanother case, the reference sequence may be that of a member of theoriginal protein variant library. From the reference sequence, themethod may select subsequences for effecting the variations. In additionor alternatively, the sequence activity model ranks residue positions(or specific residues at certain positions) in order of impact on thedesired activity.

One goal of the method may be to generate a new protein variant library.As part of this process, the method may identify sequences that are tobe used for generating this new library. Such sequences includevariations on the residues identified in (c) above or are precursorsused to subsequently introduce such variations. The sequences may bemodified by performing mutagenesis or a recombination-based diversitygeneration mechanism to generate the new library of protein variants.This may form part of a directed evolution procedure. The new librarymay also be used in developing a new sequence activity model.

In some embodiments, the method involves selecting one or more membersof the new protein variant library for production. One or more of thesemay then be synthesized and/or expressed in an expression system.

Another aspect of the invention pertains to methods for defining alibrary of biological molecules. Such methods may be characterized bythe following sequence of operations: (a) receiving an original set ofdata points representing the activity and sequence of multiplebiological molecules in a training set; (b) constructing a bootstrap setof data points selected, with replacement, from the original set of datapoints; (c) generating a model from the bootstrap set, which modelcomprises indicators of the relative importance of individual residuesor other units in biological molecules represented by the data points inthe bootstrap set; (d) repeating (b) and (c) multiple times to generatemultiple values of each indicator from the models generated in (c); (e)for each indicator, determining (i) an average or mean value of themultiple values and (ii) a statistical indication of the distribution ofthe multiple values; (f) ranking the individual residues or other unitson basis of their respective values of (i) and (ii) determined in (e);and (g) toggling particular ones of the individual residues or otherunits based on rankings produced in (f) to thereby define the library ofbiological molecules.

Yet another aspect of the invention pertains to apparatus and computerprogram products including machine-readable media on which are providedprogram instructions and/or arrangements of data for implementing themethods and software systems described above. Frequently, the programinstructions are provided as code for performing certain methodoperations. Data, if employed to implement features of this invention,may be provided as data structures, database tables, data objects, orother appropriate arrangements of specified information. Any of themethods or systems of this invention may be represented, in whole or inpart, as such program instructions and/or data provided onmachine-readable media.

These and other features of the present invention will be described inmore detail below in the detailed description of the invention and inconjunction with the following figures.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A is a flow chart depicting a sequence of operations, includingidentifying particular residues for variation, that may be used togenerate one or more generations of protein variant libraries.

FIG. 1B is a flow chart depicting a bootstrap p-value method ofgenerating protein variant libraries in accordance with an embodiment ofthis invention.

FIG. 2 is a graph that illustrates a convex Pareto front in a plot of ahypothetical set of data.

FIG. 3 is a graph that illustrates a non-convex Pareto front in a plotof a hypothetical set of data.

FIG. 4 is a chart that depicts certain steps performed in one embodimentof a method of identifying members of a population of biopolymersequence variants most suitable for artificial evolution.

FIG. 5 is a chart that depicts certain steps performed in one embodimentof a method of identifying members of a set of biopolymer characterstring variants that include multiple improved objectives relative toother members of the set of biopolymer character string variants.

FIG. 6 is a chart that depicts steps performed in one embodiment of amethod of evolving libraries for directed evolution.

FIG. 7 is a chart that depicts certain steps performed in an embodimentof a method of producing a fitter population of character stringlibraries.

FIG. 8 is a chart that shows certain steps performed in an embodiment ofa method of selecting amino acid positions in a polypeptide variant toartificially evolve.

FIG. 9 is a chart that shows certain steps performed in anotherembodiment of a method of selecting amino acid positions in apolypeptide variant to artificially evolve.

FIG. 10 is a chart that shows certain steps performed in an embodimentof a method of identifying amino acids in polypeptides that areimportant for a polypeptide sequence-activity relationship.

FIG. 11 is a chart that depicts certain steps performed in oneembodiment of a method for efficiently searching sequence space.

FIG. 12 is a chart that illustrates certain steps performed in oneembodiment of a method for efficiently searching sequence space.

FIG. 13 is a chart that shows certain steps performed in an embodimentof a method of predicting character strings that include desiredproperties.

FIG. 14 schematically illustrates an example organizational treeaccording to one embodiment of the invention.

FIG. 15 is a chart that depicts certain steps performed in oneembodiment of a method of predicting properties of target polypeptidecharacter strings.

FIG. 16 is a schematic of an example digital device.

DETAILED DISCUSSION OF THE INVENTION I. DEFINITION

Before describing the present invention in detail, it is to beunderstood that this invention is not limited to particular compositionsor systems, which can, of course vary. It is also to be understood thatthe terminology used herein is for the purpose of describing particularembodiments only, and is not intended to be limiting. As used in thisspecification and appended claims, the singular forms “a”, “an”, and“the” include plural referents unless the content and context clearlydictates otherwise. Thus, for example, reference to “a device” includesa combination of two or more such devices, and the like. Unlessindicated otherwise, an “or” conjunction is intended to be used in itscorrect sense as a Boolean logical operator, encompassing both theselection of features in the alternative (A or B, where the selection ofA is mutually exclusive from B) and the selection of features inconjunction (A or B, where both A and B are selected).

The following definitions and those included throughout this disclosuresupplement those known to persons of skill in the art.

A “bio-molecule” refers to a molecule that is generally found in abiological organism. Preferred biological molecules include biologicalmacromolecules that are typically polymeric in nature being composed ofmultiple subunits (i.e., “biopolymers”). Typical bio-molecules include,but are not limited to molecules that share some structural featureswith naturally occurring polymers such as an RNAs (formed fromnucleotide subunits), DNAs (formed from nucleotide subunits), andpolypeptides (formed from amino acid subunits), including, e.g., RNAs,RNA analogues, DNAs, DNA analogues, polypeptides, polypeptide analogues,peptide nucleic acids (PNAs), combinations of RNA and DNA (e.g.,chimeraplasts), or the like. Bio-molecules also include, e.g., lipids,carbohydrates, or other organic molecules that are made by one or moregenetically encodable molecules (e.g., one or more enzymes or enzymepathways) or the like.

The term “nucleic acid” refers to deoxyribonucleotides orribonucleotides and polymers (e.g., oligonucleotides, polynucleotides,etc.) thereof in either single- or double-stranded form. Unlessspecifically limited, the term encompasses nucleic acids containingknown analogs of natural nucleotides which have similar bindingproperties as the reference nucleic acid and are metabolized in a mannersimilar to naturally occurring nucleotides. Unless otherwise indicated,a particular nucleic acid sequence also implicitly encompassesconservatively modified variants thereof (e.g., degenerate codonsubstitutions) and complementary sequences and as well as the sequenceexplicitly indicated. Specifically, degenerate codon substitutions maybe achieved by generating sequences in which the third position of oneor more selected (or all) codons is substituted with mixed-base and/ordeoxyinosine residues (Batzer et al. (1991) Nucleic Acid Res. 19:5081;Ohtsuka et al. (1985) J. Biol. Chem. 260:2605-2608; Rossolini et al.(1994) Mol. Cell. Probes 8:91-98). The term nucleic acid is usedinterchangeably with, e.g., oligonucleotide, polynucleotide, gene, cDNA,and mRNA encoded by a gene.

A “nucleic acid sequence” refers to the order and identity of thenucleotides comprising a nucleic acid.

A “polynucleotide” is a polymer of nucleotides (A, C, T, U, G, etc. ornaturally occurring or artificial nucleotide analogues) or a characterstring representing a polymer of nucleotides, depending on context.Either the given nucleic acid or the complementary nucleic acid can bedetermined from any specified polynucleotide sequence.

The term “gene” is used broadly to refer to any segment of DNAassociated with a biological function. Thus, genes include codingsequences and optionally, the regulatory sequences required for theirexpression. Genes also optionally include nonexpressed DNA segmentsthat, for example, form recognition sequences for other proteins. Genescan be obtained from a variety of sources, including cloning from asource of interest or synthesizing from known or predicted sequenceinformation, and may include sequences designed to have desiredparameters.

Two nucleic acids are “recombined” when sequences from each of the twonucleic acids are combined in a progeny nucleic acid. Two sequences are“directly” recombined when both of the nucleic acids are substrates forrecombination.

The terms “polypeptide” and “protein” are used interchangeably herein torefer to a polymer of amino acid residues. Typically, the polymer has atleast about 30 amino acid residues, and usually at least about 50 aminoacid residues. More typically, they contain at least about 100 aminoacid residues. The terms apply to amino acid polymers in which one ormore amino acid residues are analogs, derivatives or mimetics ofcorresponding naturally occurring amino acids, as well as to naturallyoccurring amino acid polymers. For example, polypeptides can be modifiedor derivatized, e.g., by the addition of carbohydrate residues to formglycoproteins. The terms “polypeptide,” and “protein” includeglycoproteins, as well as non-glycoproteins.

A “motif” refers to a pattern of subunits in or among biologicalmolecules. For example, the motif can refer to a subunit pattern of theunencoded biological molecule or to a subunit pattern of an encodedrepresentation of a biological molecule.

“Screening” refers to the process in which one or more properties of oneor more bio-molecule is determined. For example, typical screeningprocesses include those in which one or more properties of one or moremembers of one or more libraries is/are determined.

“Selection” refers to the process in which one or more bio-molecules areidentified as having one or more properties of interest. Thus, forexample, one can screen a library to determine one or more properties ofone or more library members. If one or more of the library membersis/are identified as possessing a property of interest, it is selected.Selection can include the isolation of a library member, but this is notnecessary. Further, selection and screening can be, and often are,simultaneous.

The term “covariation” refers to the correlated variation of two or morevariables (e.g., amino acids in a polypeptide, etc.).

“Genetic algorithms” are processes which mimic evolutionary processes.Genetic algorithms (GAs) are used in a wide variety of fields to solveproblems which are not fully characterized or too complex to allow fullcharacterization, but for which some analytical evaluation is available.That is, GAs are used to solve problems which can be evaluated by somequantifiable measure for the relative value of a solution (or at leastthe relative value of one potential solution in comparison to another).In the context of the present invention, a genetic algorithm is aprocess for selecting or manipulating character strings in a computer,typically where the character string corresponds to one or morebiological molecules (e.g., nucleic acids, proteins, PNAs, or the like).

“Directed evolution” or “artificial evolution” refers to a process ofartificially changing a character string by artificial selection,recombination, or other manipulation, i.e., which occurs in areproductive population in which there are (1) varieties of individuals,with some varieties being (2) heritable, of which some varieties (3)differ in fitness (reproductive success determined by outcome ofselection for a predetermined property (desired characteristic). Thereproductive population can be, e.g., a physical population or a virtualpopulation in a computer system.

“Genetic operators” are user-defined operations, or sets of operations,each including a set of logical instructions for manipulating characterstrings. Genetic operators are applied to cause changes in populationsof individuals in order to find interesting (useful) regions of thesearch space (populations of individuals with predetermined desiredproperties) by predetermined means of selection. Predetermined (orpartially predetermined) means of selection include computational tools(operators comprising logical steps guided by analysis of informationdescribing libraries of character strings), and physical tools foranalysis of physical properties of physical objects, which can be built(synthesized) from matter with the purpose of physically creating arepresentation of information describing libraries of character strings.In a preferred embodiment, some or all of the logical operations areperformed in a digital system.

When referring to operations on strings (e.g., recombinations,hybridizations, elongations, fragmentations, segmentations, insertions,deletions, transformations, etc.) it will be appreciated that theoperation can be performed on the encoded representation of a biologicalmolecule or on the “molecule” prior to encoding so that the encodedrepresentation captures the operation.

A “data structure” refers to the organization and optionally associateddevice for the storage of information, typically multiple “pieces” ofinformation. The data structure can be a simple recordation of theinformation (e.g., a list) or the data structure can contain additionalinformation (e.g., annotations) regarding the information containedtherein, can establish relationships between the various “members”(i.e., information “pieces”) of the data structure, and can providepointers or links to resources external to the data structure. The datastructure can be intangible but is rendered tangible when stored orrepresented in a tangible medium (e.g., paper, computer readable medium,etc.). The data structure can represent various informationarchitectures including, but not limited to simple lists, linked lists,indexed lists, data tables, indexes, hash indices, flat file databases,relational databases, local databases, distributed databases, thinclient databases, and the like. In preferred embodiments, the datastructure provides fields sufficient for the storage of one or morecharacter strings. The data structure is optionally organized to permitalignment of the character strings and, optionally, to store informationregarding the alignment and/or string similarities and/or stringdifferences. In one embodiment, this information is in the form ofalignment “scores” (e.g., similarity indices) and/or alignment mapsshowing individual subunit (e.g., nucleotide in the case of nucleicacid) alignments. The term “encoded character string” refers to arepresentation of a biological molecule that preserves desiredsequence/structural information regarding that molecule. As notedthroughout, non-sequence properties of bio-molecules can be stored in adata structure and alignments of such non-sequence properties, in amanner analogous to sequence based alignment can be practiced.

It is generally assumed that two nucleic acids have common ancestry whenthey demonstrate sequence similarity. However, the exact level ofsequence similarity necessary to establish homology varies in the art.In general, for purposes of this disclosure, two nucleic acid sequencesare deemed to be homologous when they share enough sequence identity topermit direct recombination to occur between the two sequences.

A “phylogenetic family” refers to organisms, nucleic acid sequences,polypeptides sequences, or the like that share a common evolutionaryrelationship or lineage pattern.

A “subsequence” or “fragment” is any portion of an entire sequence ofnucleic acids or amino acids.

A “library” or “population” refers to a collection of at least twodifferent molecules and/or character strings, such as nucleic acidsequences (e.g., genes, oligonucleotides, etc.) or expression products(e.g., enzymes) therefrom. A library or population generally includes anumber of different molecules. For example, a library or populationtypically includes at least about 10 different molecules. Largelibraries typically include at least about 100 different molecules, moretypically at least about 1000 different molecules. For someapplications, the library includes at least about 10000 or moredifferent molecules.

“Classification And Regression Trees” or “CART” refers to aclassification tree program that uses an exhaustive grid search of allpossible univariate splits to find the splits for a classification tree.

“Systematic variance” refers to different descriptors of an item or setof items being changed in different combinations.

“Systematically varied data” refers to data produced, derived, orresulting from different descriptors of an item or set of items beingchanged in different combinations. Many different descriptors can bechanged at the same time, but in different combinations. For example,activity data gathered from polypeptides in which combinations of aminoacids have been changed is systematically varied data.

A “descriptor” refers to something that serves to describe or identifyan item. For example, characters in a character string can bedescriptors of amino acids in a polypeptide being represented by thecharacter string.

A “hyperbox” refers to a selected region in the objective space (e.g.,sequence space) that includes at least one individual (e.g., a scoredbio-molecule or character string representation of the bio-molecule)that lies at least proximate to a Pareto front in a given set of data.

The terms “sequence” and “character strings” are used interchangeablyherein to refer to the order and identity of amino acid residues in aprotein (i.e., a protein sequence or protein character string) or to theorder and identity of nucleotides in a nucleic acid (i.e., a nucleicacid sequence or nucleic acid character string).

II. GENERATING IMPROVED PROTEIN VARIANT LIBRARIES

In accordance with the present invention, various methods are providedfor generating new protein variant libraries that can be used to exploreprotein sequence and activity space. A feature of many such methods is aprocedure for identifying amino acid residues in a protein sequence thatare predicted to impact a desired activity. As one example, suchprocedure includes the following operations:

(a) receiving data characterizing a training set of a protein variants,wherein the data provides activity and sequence information for eachprotein variant in the training set;

(b) from the data, developing a sequence activity model that predictsactivity as a function of amino acid residue type and correspondingposition in the sequence;

(c) using the sequence activity model to identify one or more amino acidresidues at specific positions in one or more protein variants that areto be varied in order to impact the desired activity.

Other methods including slight variations of this method are within thescope of the present invention as set forth herein.

FIG. 1A presents a flow chart showing various operations that may beperformed in the order depicted or in some other order. As shown, aprocess 01 begins at a block 03 with receipt of data describing atraining set comprising residue sequences for a protein variant library.In other words, the training set data is derived from a protein variantlibrary. Typically that data will include, for each protein in thelibrary, a complete or partial residue sequence together with anactivity value. In some cases, multiple types of activities (e.g., rateconstant and thermal stability) are provided together in the trainingset.

In many embodiments, the individual members of the protein variantlibrary represent a wide range of sequences and activities. This allowsone to generate a sequence-activity model having applicability over abroad region of sequence space. Techniques for generating such diverselibraries include systematic variation of protein sequences and directedevolution techniques. Both of these are described in more detailelsewhere herein.

Activity data may be obtained by assays or screens appropriatelydesigned to measure activity magnitudes. Such techniques are well knownand are not central to this invention. The principles for designingappropriate assays or screens are widely understood. Techniques forobtaining protein sequences are also well known and are not central tothis invention. The activity used with this invention may be proteinstability (e.g., thermal stability). However, many important embodimentsconsider other activities such as catalytic activity, resistance topathogens and/or toxins, therapeutic activity, toxicity, and the like.

After the training set data has been generated or acquired, the processuses it to generate a sequence-activity model that predicts activity asa function of sequence information. See block 05. Such model is anexpression, algorithm or other tool that predicts the relative activityof a particular protein when provided with sequence information for thatprotein. In other words, protein sequence information is an input andactivity prediction is an output. For many embodiments of thisinvention, the model can also rank the contribution of various residuesto activity. Methods of generating such models (e.g., partial leastsquares regression (PLS), principal component regression (PCR), andmultiple linear regression (MLR)) will be discussed below, along withthe format of the independent variables (sequence information), theformat of the dependent variable(s) (activity), and the form of themodel itself (e.g., a linear first order expression).

A model generated at block 05 is employed to identify multiple residuepositions (e.g., position 35) or specific residue values (e.g. glutamineat position 35) that are predicted to impact activity. See block 07. Inaddition to identifying such positions, it may “rank” the residuepositions or residue values based on their contributions to activity.For example, the model may predict that glutamine at position 35 has themost pronounced effect on activity, phenylalanine at position 208 hasthe second most pronounced effect, and so on. In a specific approachdescribed below, PLS or PCR regression coefficients are employed to rankthe importance of specific residues. In another specific approach, a PLSload matrix is employed to rank the importance of specific residuepositions.

After the process has identified residues that impact activity, some ofthem are selected for variation as indicated at a block 09. This is donefor the purpose of exploring sequence space. Residues are selected usingany of a number of different selection protocols, some of which will bedescribed below. In one example, specific residues predicted to have thebiggest beneficial impact on activity are preserved; in other words,they are not varied. A certain number of other residues predicted tohave a lesser impact are, however, selected for variation. In anotherexample, the residue positions found to have the biggest impact onactivity are selected, but only if they are found to vary in highperforming members of the training set. For example, if the modelpredicts that residue position 197 has the biggest impact on activity,but all or most of the proteins with high activity have leucine at thisposition, then position 197 would not be selected for variation—in thisapproach. All proteins in a next generation library would have leucineat position 197. However, if some “good” proteins had valine at thisposition and others had leucine, then the process would choose to varythe amino acid at this position.

After the residues for variation have been identified, the method nextgenerates a new variant library having the specified residue variation.See block 11. Various methodologies are available for this purpose. Inone example, an in vitro or in vivo recombination-based diversitygeneration mechanism is performed to generate the new variant library.Such procedures may employ oligonucleotides containing sequences orsubsequences for encoding the proteins of the parental variant library.Some of the oligonucleotides will be closely related, differing only inthe choice of codons for alternate amino acids selected for variation at09. The recombination-based diversity generation mechanism may beperformed for one or multiple cycles. If multiple cycles are used, eachinvolves a screening step to identify which variants have acceptableperformance to be used in a next recombination cycle. This is a form ofdirected evolution.

In a different example, a “reference” protein sequence is chosen and theresidues selected at 09 are “toggled” to identify individual members ofthe variant library. The new proteins so identified are synthesized byan appropriate technique to generate the new library. In one example,the reference sequence may be a top-performing member of the trainingset or a “best” sequence predicted by a PLS or PCR model.

In another approach, the sequence activity model is used as a “fitnessfunction” in a genetic algorithm for exploring sequence space. After oneor more rounds of the genetic algorithm (with each round using thefitness function to select one or more possible sequences for a geneticoperation), a next generation library is identified for use as describedin this flow chart.

After the new library has been produced, it is screened for activity, asindicated in a block 13. Ideally, the new library will present one ormore members with better activity than was observed in the previouslibrary. However, even without such advantage, the new library canprovide beneficial information. Its members may be employed forgenerating improved models that account for the effects of thevariations selected in 09, and thereby more accurately predict activityacross wider regions of sequence space. Further, the library mayrepresent a passage in sequence space from a local maximum toward aglobal maximum (in activity).

Depending on the goal of process 01, it may be desirable to generate aseries of new protein variant libraries, with each one providing newmembers of a training set. The updated training set is then used togenerate an improved model. To this end, process 01 is shown with adecision operation 15, which determines whether yet another proteinvariant library should be produced. Various criteria can be used to makethis decision. Examples include the number of protein variant librariesgenerated so far, the activity of top proteins from the current library,the magnitude of activity desired, and the level of improvement observedin recent new libraries.

Assuming that the process is to continue with a new library, the processreturns to operation 05 where a new sequence-activity model is generatedfrom sequence and activity data obtained for the current protein variantlibrary. In other words, the sequence and activity data for the currentprotein variant library serves as part of the training set for the newmodel (or it may serve as the entire training set). Thereafter,operations 07, 09, 11, 13, and 15 are performed as described above, butwith the new model.

At some point, in process 01, this cycle will end and no new librarywill be generated. At that point, the process may simply terminate orone or more sequences from one or more of the libraries may be selectedfor development and/or manufacture. See block 17.

A. Choosing Protein Variant Libraries

Protein variant libraries are groups of multiple proteins generated bymethods of this invention. Protein variant libraries also provide thedata for training sets used to generate sequence-activity models. Thenumber of proteins included in a protein variant library depends on theapplication and the cost.

In one example, the protein variant library is generated from one ormore naturally occurring proteins. In one example, these are proteinmembers encoded by a single gene family. Other starting points for thelibrary may be used. From these seed or starting proteins, the librarymay be generated by various techniques. In one case, the library isgenerated by classical DNA shuffling (i.e., DNA fragmentation-mediatedrecombination as described in Stemmer (1994) Proc. Natl. Acad. Sci. USA10747-10751 and WO 95/22625) or synthetic DNA shuffling (i.e., syntheticoligonucleotide-mediated recombination as described in Ness et al.(2002) Nature Biotechnology 20:1251-1255 and WO 00/42561) on nucleicacids encoding part or all of one or more parent proteins. In anothercase, a single starting sequence is modified in various ways to generatethe library. Preferably, the library is generated by systematicallyvarying the individual residues. In one example, a design of experiment(DOE) methodology is employed to identify the systematically variedsequences. In another example, a “wet lab” procedure such asoligonucleotide-mediated recombination is used to introduce some levelof systematic variation.

As used herein, the term “systematically varied sequences” refers to aset of sequences in which each residue is seen in multiple contexts. Inprinciple, the level of systematic variation can be quantified by thedegree to which the sequences are orthogonal from one another (maximallydifferent compared to the mean). In practice, the process does notdepend on having maximally orthogonal sequences, however, the quality ofthe model will be improved in direct relation to the orthogonality ofthe sequence space tested. In a simple example, a peptide sequence issystematically varied by identifying two residue positions, each ofwhich can have one of two different amino acids. A maximally diverselibrary includes all four possible sequences. Such maximal systematicvariation increases exponentially with the number of variable positions;e.g., by 2^(N), when there are 2 options at each of N residue positions.Those having ordinary skill in the art will readily recognize thatmaximal systematic variation, however, is not required by the inventionmethods. Systematic variation provides a mechanism for identifying arelatively small set of sequences for testing that provides a goodsampling of sequence space.

Protein variants having systematically varied sequences can be obtainedin a number of ways using techniques that are well known to those havingordinary skill in the art. Suitable methods include recombination-basedmethods that generate variants based on one or more “parental”polynucleotide sequences. Polynucleotide sequences can be recombinedusing a variety of techniques, including, for example, DNAse digestionof polynucleotides to be recombined followed by ligation and/or PCRreassembly of the nucleic acids. These methods include those describedin, for example, Stemmer (1994) Proc. Natl. Acad. Sci. USA,91:10747-10751, U.S. Pat. No. 5,605,793, “Methods for In VitroRecombination,” U.S. Pat. No. 5,811,238, “Methods for GeneratingPolynucleotides having Desired Characteristics by Iterative Selectionand Recombination,” U.S. Pat. No. 5,830,721, “DNA Mutagenesis by RandomFragmentation and Reassembly,” U.S. Pat. No. 5,834,252, “EndComplementary Polymerase Reaction,” U.S. Pat. No. 5,837,458, “Methodsand Compositions for Cellular and Metabolic Engineering,” “WO/42832,“Recombination of Polynucleotide Sequences Using Random or DefinedPrimers,” WO 98/27230, “Methods and Compositions for PolypeptideEngineering,” WO 99/29902, “Method for Creating Polynucleotide andPolypeptide Sequences,” and the like.

Synthetic recombination methods are also particularly well suited forgenerating protein variant libraries with systematic variation. Insynthetic recombination methods, a plurality of oligonucleotides aresynthesized which collectively encode a plurality of the genes to berecombined. Typically the oligonucleotides collectively encode sequencesderived from homologous parental genes. For example, homologous genes ofinterest are aligned using a sequence alignment program such as BLAST(Atschul, et al., J. Mol. Biol., 215:403-410 (1990). Nucleotidescorresponding to amino acid variations between the homologues are noted.These variations are optionally further restricted to a subset of thetotal possible variations based on covariation analysis of the parentalsequences, functional information for the parental sequences, selectionof conservative or non-conservative changes between the parentalsequences, or other like criteria. Variations are optionally furtherincreased to encode additional amino acid diversity at positionsidentified by, for example, covariation analysis of the parentalsequences, functional information for the parental sequences, selectionof conservative or non-conservative changes between the parentalsequences, or apparent tolerance of a position for variation. The resultis a degenerate gene sequence encoding a consensus amino acid sequencederived from the parental gene sequences, with degenerate nucleotides atpositions encoding amino acid variations. Oligonucleotides are designedwhich contain the nucleotides required to assemble the diversity presentin the degenerate gene. Details regarding such approaches can be foundin, for example, Ness et al. (2002), Nature Biotechnology 20:1251-1255,WO 00/42561, “Oligonucleotide Mediated Nucleic Acid Recombination,” WO00/42560, “Methods for Making Character Strings, Polynucleotides andPolypeptides having Desired Characteristics,” WO 01/75767, “In SilicoCross-Over Site Selection,” and WO 01/64864, “Single-Stranded NucleicAcid Template-Mediated Recombination and Nucleic Acid FragmentIsolation.”

The polynucleotide variant sequences are then transcribed andtranslated, either in vitro or in vivo, to create a set or library ofprotein variant sequences.

The set of systematically varied sequences can also be designed a prioriusing design of experiment (DOE) methods to define the sequences in thedata set. A description of DOE methods can be found in Diamond, W. J.(2001) Practical Experiment Designs: for Engineers and Scientists, JohnWiley & Sons and in “Practical Experimental Design for engineers andscientists” by William J Drummond (1981) Van Nostrand Reinhold Co NewYork, “Statistics for experimenters” George E. P. Box, William G Hunterand J. Stuart Hunter (1978) John Wiley and Sons, New York, or, e.g., onthe world wide web at itl.nist.gov/div898/handbook/. There are severalcomputational packages available to perform the relevant mathematics,including Statistics Toolbox (MatLab), JMP, Statistica and StateaseDesign expert. The result is a systematically varied and orthogonaldispersed data set of sequences that is suitable for building thesequence activity model of the present invention. DOE-based data setscan be readily generated using either Plackett-Burman or FractionalFactorial designs. Id.

In engineering or chemical sciences, fractional factorial designs, forexample, are used to define fewer experiments (than in full factorialdesigns) in which a factor is varied (toggled) between two or morelevels. Optimization techniques are used to ensure that the experimentschosen are maximally informative in accounting for factor spacevariance. The same design approaches (e.g., fractional factorial,D-optimal design) can be applied in protein engineering to constructfewer sequences where a given number of positions are toggled betweentwo or more residues. This set of sequences would be an optimaldescription of systematic variance present in the protein sequence spacein question. Once activities for the corresponding molecules (e.g.,polynucleotides can be constructed via gene synthesis in accordance witha reverse translation of the sequence designs, then expressed aspolypeptides) are measured, a regression model which tends to be anoptimal solution, is developed. It should be mentioned that there is norestriction on the number of sequences to be constructed.

An example of the DOE approach applied to protein engineering includesthe following operations:

-   -   1) Identify positions to toggle based on the principles        described earlier (present in parental sequences, level of        conservation, etc.)    -   2) Create a DOE experiment using one of the commonly available        statistical packages by defining the number of factors (variable        positions), the number of levels (choices at each position), and        the number of experiments to run. The information content of the        output matrix (typically consisting of 1s and 0s that represent        residue choices at each position) depends directly on the number        of experiments to run (the more the better).    -   3) Use the output matrix to construct a protein alignment that        codes the 1s and 0s back to specific residue choices at each        position.    -   4) Synthesize the genes encoding the proteins represented in the        protein alignment.    -   5) Test the proteins encoded by the synthesized genes in        relevant assay(s).    -   6) Build a model on the tested genes/proteins.    -   7) Follow the steps described before to identify positions of        importance and to build a subsequent library with improved        fitness.

For example purposes, consider a protein in which the functionally bestamino acid residues at 20 positions are to be determined, e.g., wherethere are 2 possible amino acids available at each position. In thiscase, a resolution IV factorial design would be appropriate. Aresolution IV design is defined as one which is capable of elucidatingthe effects of all single variables, with no two-factor effectsoverlapping them. The design would then specify a set of 40 specificamino acid sequences that would cover the total diversity of 2²⁰ (˜1million) possible sequences. These sequences are then generated by astandard gene synthesis protocol and the function and fitness of theseclones is determined.

An alternative to the above approaches is to employ all availablesequences, e.g., the GenBank® database and other public sources, toprovide the protein variant library. Although this entails massivecomputational power, current technologies make the approach feasible.Mapping all available sequences provides an indication of sequence spaceregions of interest.

B. Generating a Sequence Activity Model & Using that Model to IdentifyResidue Positions for Variation

As indicated above, a sequence-activity model used with the presentinvention relates protein sequence information to protein activity. Theprotein sequence information used by the model may take many forms.Frequently, it is a complete sequence of the amino acid residues in aprotein; e.g., HGPVFSTGGA . . . . In some cases, however, it may beunnecessary to provide the complete amino acid sequence. For example, itmay be sufficient to provide only those residues that are to be variedin a particular research effort. At later stages in research, forexample, many residues may be fixed and only limited regions of sequencespace remain to be explored. In such situations, it may be convenient toprovide sequence activity models that require, as inputs, only theidentification of those residues in the regions of the protein where theexploration continues. Still further, some models may not require exactidentities of residues at the residue positions, but instead identifyone or more physical or chemical properties that characterize the aminoacid at a particular residue position. For example, the model mayrequire specification of residue positions by bulk, hydrophobicity,acidity, etc. In some models, combinations of such properties areemployed.

The form of the sequence-activity model can vary widely, so long as itprovides a vehicle for correctly approximating the relative activity ofproteins based on sequence information. Generally, it will treatactivity as a dependent variable and sequence/residue values asindependent variables. Examples of the mathematical/logical form ofmodels include linear and non-linear mathematical expressions of variousorders, neural networks, classification and regression trees/graphs,clustering approaches, recursive partitioning, support vector machines,and the like. In one preferred embodiment, the model form is a linearadditive model in which the products of coefficients and residue valuesare summed. In another preferred embodiment, the model form is anon-linear product of various sequence/residue terms, including certainresidue cross-products (which represent interaction terms betweenresidues).

Models are developed from a training set of activity versus sequenceinformation to provide the mathematical/logical relationship betweenactivity and sequence. This relationship is typically validated prior touse for predicting activity of new sequences or residue importance.

Various techniques for generating models are available. Frequently, suchtechniques are optimization or minimization techniques. Specificexamples include partial least squares, various other regressiontechniques, as well as genetic programming optimization techniques,neural network techniques, recursive partitioning, and support vectormachine techniques. Generally, the technique should produce a model thatcan distinguish residues that have a significant impact on activity fromthose that do not. Preferably, the model should also rank individualresidues or residue positions based on their impact on activity.

In one important class of techniques, models are generated by aregression technique that identifies covariation of independent anddependent variables in a training set. Various regression techniques areknown and widely used. Examples include multiple linear regression(MLR), principal component regression (PCR) and partial least squaresregression (PLS).

MLR is the most basic of these techniques. It simply solves a set ofcoefficient equations for members of a training set. Each equationrelates to the activity of a training set member (dependent variable)with the presence or absence of a particular residue at a particularposition (independent variables). Depending upon the number of residueoptions in the training set, these expressions can be quite large.

Like MLS, PLS and PCR generate models from equations relating sequenceactivity to residue values. However, these techniques do so in adifferent manner. They first perform a coordinate transformation toreduce the number of independent variables. They then perform theregression on the transformed variables. In MLR, there are a potentiallyvery large number of independent variables: two or more for each residueposition that varies within the training set. Given that proteins andpeptides of interest are often quite large and the training set mayprovide many different sequences, the number of independent variablescan quickly become very large. By reducing the number of variables tofocus on those that provide the most variation in the data set, PLS andPCR generally require fewer samples and simplify the problem ofgenerating a model.

PCR is similar to PLS regression in that the actual regression is doneon a relatively small number of latent variables obtained by coordinatetransformation of the raw independent variables (residue values). Thedifference between PLS and PCR is that the latent variables in PCR areconstructed by maximizing covariation between the independent variables(residue values). In PLS regression, the latent variables areconstructed in such a way as to maximize the covariation between theindependent variables and the dependent variables (activity values).Partial Least Squares regression is described in Hand, D. J., et al.(2001) Principles of Data Mining (Adaptive Computation and MachineLearning), Boston, Mass., MIT Press, and in Geladi, et al. (1986)“Partial Least-Squares Regression: a Tutorial,” Anal. Chim. Acta,198:1-17. Both of these references are incorporated herein by referencefor all purposes.

In PCR and PLS, the direct result of the regression is an expression foractivity that is a function of the weighted latent variables. Thisexpression can be transformed to an expression for activity as afunction of the original independent variables by performing acoordinate transformation that converts the latent variables back to theoriginal independent variables.

In essence, both PCR and PLS first reduce the dimensionality of theinformation contained in the training set and then perform a regressionanalysis on a transformed data set; which has been transformed toproduce new independent variables, but preserves the original dependentvariable values. The transformed versions of the data sets may result inonly a relatively few expressions for performing the regressionanalysis. Compare this with a situation where no dimension reduction isperformed. In that situation, each separate residue for which there canbe a variation must be considered. This can be a very large set ofcoefficients; 2^(N) coefficients, where N is the number of residuepositions that may vary in the training set. In a typical principalcomponent analysis, only 3, 4, 5, 6 principal components are employed.

Another class of tools that can be used to generate models in accordancewith this invention is the support vector machines. These mathematicaltools take as inputs training sets of sequences that have beenclassified into two or more groups based on activity. Support vectormachines operate by weighting different members of a training setdifferently depending upon how close they are to a hyperplane interfaceseparating “active” and “inactive” members of the training set. Thistechnique requires that the scientist first decide which training setmembers to place in the active group and which training set members toplace in the inactive group. This may be accomplished by choosing anappropriate numerical value of activity to serve as the boundary betweenactive and inactive members of the training set. From thisclassification, the support vector machine will generate a vector, W,that can provide coefficient values for individual ones of theindependent variables defining the sequences of the active and inactivegroup members in the training set. These coefficients can be used to“rank” individual residues as described elsewhere herein. The techniqueattempts to identify a hyperplane that maximizes the distance betweenthe closest training set members on opposite sides of that plane. Inanother variation, support vector regression modeling is carried out. Inthis case, the dependent variable is a vector of continuous activityvalues. The support vector regression model will generate a coefficientvector, W, which can be used to rank individual residues.

SVMs have been used to look at large data sets in many studies and havebeen quite popular in the DNA microarray field. Their potentialstrengths include the ability to finely discriminate (by weighting)which factors separate samples from each other. To the extent that anSVM can tease out precisely which residues contribute to function, itcan be a particularly useful tool for ranking residues in accordancewith this invention. SVMs are described in S. Gunn (1998) “SupportVector Machines for Classification and Regressions,” Technical Report,Faculty of Engineering and Applied Science, Department of Electronicsand Computer Science, University of Southampton, which is incorporatedherein by reference for all purposes.

Another model generation technique of interest is genetic programming.This technique employs a Darwinian style evolution to discover theformulae and rules that characterize the data of a training set. It canbe used in regression problems of the types described herein. Theunderlying effect can be linear or non-linear. Genetic programming isdescribed in R. Goodacre et al. (2000) “Detection of the DipicolinicAcid Biomarker in Bacillus Spores Using Curie-Point Pyrolysis MassSpectrometry and Fourier Transform Infrared Spectroscopy,” Anal. Chem.,72, 119-127, which is incorporated herein by reference for all purposes.Examples of software tools for performing genetic programming includethe “GMAX” and the “GMAX-Bio” available from Aber Genomic Computing Ltdof Wales, UK.

In general, a regression model employed in the practice of the presentinvention has the following form:

$\begin{matrix}{y = {\sum\limits_{i = 1}^{N}\; {\sum\limits_{j = 1}^{M}\; {c_{ij}x_{ij}}}}} & (1)\end{matrix}$

In this expression, y is predicted response, while c_(ij) and x_(ij) arethe regression coefficient and bit value (i.e., residue choice)respectively at position i in the sequence. There are N residuepositions in the sequences of the protein variant library and each ofthese may be occupied by one or more residues. At any given position,there may be j=1 through M separate residue types. This model assumes alinear (additive) relationship between the residues at every position.An expanded version of equation 1 follows:

y=c ₀ +c ₁₁ x ₁₁ +c ₁₂ x ₁₂ + . . . c _(1M) x _(1M) +c ₂₁ x ₂₁ +c ₂₂ x₂₂ + . . . c _(2M) x _(2M) + . . . +c _(NM) x _(NM)

As indicated, data in the form of activity and sequence information isderived from the initial protein variant library and used to determinethe regression coefficients of the model. The bit values are firstidentified from an alignment of the protein variant sequences. Aminoacid residue positions are identified from among the protein variantsequences in which the amino acid residues in those positions differbetween sequences. Amino acid residue information in some or all ofthese variable residue positions may be incorporated in the sequenceactivity model.

Table I contains sequence information in the form of variable residuepositions and residue type for 10 illustrative variant proteins, alongwith activity values corresponding to each variant protein. Understand,that these are representative members of a larger set that is requiredto generate enough equations to solve for all the coefficients. Thus,for example, for the illustrative protein variant sequences in Table I,positions 10, 166, 175, and 340, are variable residue positions and allother positions, i.e., those not indicated in the Table, containresidues that are identical between Variants 1-10.

TABLE I Illustrative Sequence and Activity Data Variable Positions: 10166 175 340 y (activity) Variant 1 Ala Ser Gly Phe y₁ Variant 2 Asp PheVal Ala y₂ Variant 3 Lys Leu Gly Ala y₃ Variant 4 Asp Ile Val Phe y₄Variant 5 Ala Ile Val Ala y₅ Variant 6 Asp Ser Gly Phe y₆ Variant 7 LysPhe Gly Phe y₇ Variant 8 Ala Phe Val Ala y₈ Variant 9 Lys Ser Gly Phe y₉Variant 10 Asp Leu Val Ala  y₁₀ and so on.Thus, based on equation 1, a regression model can be derived from thesystematically varied library in Table I, i.e.:

$\begin{matrix}{y = {c_{0} + {c_{10{Ala}}x_{10{Ala}}} + {c_{10{Asp}}x_{10{Asp}}} + {c_{10{Lys}}x_{10{Lys}}} + {c_{166{Ser}}x_{166{Ser}}} + {c_{166{Phe}}x_{166{Phe}}} + {c_{166{Leu}}x_{166{Leu}}} + {c_{166{Ile}}x_{166{Ile}}} + {c_{175{Gly}}x_{175{Gly}}} + {c_{175{Val}}x_{175{Val}}} + {c_{340{Phe}}x_{340{Phe}}} + {c_{340{Ala}}x_{340{Ala}}}}} & (2)\end{matrix}$

The bit values (x variables) can be represented as either 1 or 0reflecting the presence or absence of the designated amino acid residueor alternatively, 1 or −1. For example, using the 1 or 0 designation,X_(10Ala) would be “1” for Variant 1 and “0” for Variant 2. Using the 1or −1 designation, X_(10Ala) would be “1” for Variant 1 and “−1” forVariant 2. The regression coefficients can thus be derived fromregression equations based on the sequence activity information for allvariants in library. Examples of such equations for Variants 1-10 (usingthe 1 or 0 designation for x) follow:

y ₁ =c ₀ +c _(10 Ala)(1)+c _(10Asp)(0)+c _(10 Lys)(0)+c _(166Ser)(1)+c_(166 Phe)(0)+c _(166Leu)(0)+c _(166Ile)(0)+c _(175Gly)(1)+c_(175 Val)(0)+c _(340 Phe)(1)+c _(340 Ala)(0)

y ₂ =c ₀ +c _(10 Ala)(0)+c _(10Asp)(1)+c _(10 Lys)(0)+c _(166Ser)(0)+c_(166 Phe)(1)+c _(166Leu)(0)+c _(166Ile)(0)+c _(175Gly)(0)+c_(175 Val)(1)+c _(340 Phe)(0)+c _(340 Ala)(1)

y ₃ =c ₀ +c _(10 Ala)(0)+c _(10Asp)(0)+c _(10 Lys)(1)+c _(166Ser)(0)+c_(166 Phe)(0)+c _(166Leu)(1)+c _(166Ile)(0)+c _(175Gly)(1)+c_(175 Val)(0)+c _(340 Phe)(0)+c _(340 Ala)(1)

y ₄ =c ₀ +c _(10 Ala)(0)+c _(10Asp)(1)+c _(10 Lys)(0)+c _(166Ser)(0)+c_(166 Phe)(0)+c _(166Leu)(0)+c _(166Ile)(1)+c _(175Gly)(0)+c_(175 Val)(1)+c _(340 Phe)(1)+c _(340 Ala)(0)

y ₅ =c ₀ +c _(10 Ala)(1)+c _(10Asp)(0)+c _(10 Lys)(0)+c _(166Ser)(0)+c_(166 Phe)(0)+c _(166Leu)(0)+c _(166Ile)(1)+c _(175Gly)(0)+c_(175 Val)(1)+c _(340 Phe)(0)+c _(340 Ala)(1)

y ₆ =c ₀ +c _(10 Ala)(0)+c _(10Asp)(1)+c _(10 Lys)(0)+c _(166Ser)(1)+c_(166 Phe)(0)+c _(166Leu)(0)+c _(166Ile)(0)+c _(175Gly)(1)+c_(175 Val)(0)+c _(340 Phe)(1)+c _(340 Ala)(0)

y ₇ =c ₀ +c _(10 Ala)(0)+c _(10Asp)(0)+c _(10 Lys)(1)+c _(166Ser)(0)+c_(166 Phe)(1)+c _(166Leu)(0)+c _(166Ile)(0)+c _(175Gly)(1)+c_(175 Val)(0)+c _(340 Phe)(1)+c _(340 Ala)(0)

y ₈ =c ₀ +c _(10 Ala)(1)+c _(10Asp)(0)+c _(10 Lys)(0)+c _(166Ser)(0)+c_(166 Phe)(1)+c _(166Leu)(0)+c _(166Ile)(0)+c _(175Gly)(0)+c_(175 Val)(1)+c _(340 Phe)(0)+c _(340 Ala)(1)

y ₉ =c ₀ +c _(10 Ala)(0)+c _(10Asp)(0)+c _(10 Lys)(1)+c _(166Ser)(1)+c_(166 Phe)(0)+c _(166Leu)(0)+c _(166Ile)(0)+c _(175Gly)(1)+c_(175 Val)(0)+c _(340 Phe)(1)+c _(340 Ala)(0)

y ₁₀ =c ₀ +c _(10 Ala)(0)+c _(10Asp)(1)+c _(10 Lys)(0)+c _(166Ser)(0)+c_(166 Phe)(0)+c _(166Leu)(1)+c _(166Ile)(0)+c _(175Gly)(0)+c_(175 Val)(1)+c _(340 Phe)(0)+c _(340 Ala)(1)

The complete set of equations can be readily solved using a regressiontechnique (e.g., PCR, PLS, or MLR) to determine the value for regressioncoefficients corresponding to each residue and position of interest. Inthis example, the relative magnitude of the regression coefficientcorrelates to the relative magnitude of contribution of that particularresidue at the particular position to activity. The regressioncoefficients may then be ranked or otherwise categorized to determinewhich residues are more likely to favorably contribute to the desiredactivity. Table II provides illustrative regression coefficient valuescorresponding to the systematically varied library exemplified in TableI:

TABLE II Illustrative Rank Ordering of Regression CoefficientsREGRESSION COEFFICIENT VALUE c_(166Ile) 62.15 c_(175Gly) 61.89 c_(10Asp)60.23 c_(340Ala) 57.45 c_(10Ala) 50.12 c_(166Phe) 49.65 c_(166Leu) 49.42c_(340Phe) 47.16 c_(166Ser) 45.34 c_(175Val) 43.65 c_(10Lys) 40.15

The rank ordered list of regression coefficients can be used toconstruct a new library of protein variants that is optimized withrespect to a desired activity (i.e., improved fitness). This can be donein various ways. In one case, it is accomplished by retaining the aminoacid residues having coefficients with the highest observed values.These are the residues indicated by the regression model to contributethe most to desired activity. If negative descriptors are employed toidentify residues (e.g., 1 for leucine and −1 for glycine), it becomesnecessary to rank residue positions based on absolute value of thecoefficient. Note that in such situations, there is typically only asingle coefficient for each residue. The absolute value of thecoefficient magnitude gives the ranking of the corresponding residueposition. Then, it becomes necessary to consider the signs of theindividual residues to determine whether each of them is detrimental orbeneficial in terms of the desired activity.

Residues are generally considered in the order in which they are ranked.For each residue under consideration, the process determines whether to“toggle” that residue. The term “toggling” refers to the introduction ofmultiple amino acid residue types into a specific position in thesequences of protein variants in the optimized library. For example,serine may appear in position 166 in one protein variant, whereasphenylalanine may appear in position 166 in another protein variant inthe same library. Amino acid residues that did not vary between proteinvariant sequences in the training set typically remain fixed in theoptimized library.

An optimized protein variant library can be designed such that all ofthe identified “high” ranking regression coefficient residues are fixed,and the remaining lower ranking regression coefficient residues aretoggled. The rationale for this being that one should search the localspace surrounding the ‘best’ predicted protein. Note that the startingpoint “backbone” in which the toggles are introduced may be the bestprotein predicted by a model or an already validated ‘best’ protein froma screened library.

In an alternative approach, at least one or more, but not all of thehigh-ranking regression coefficient residues identified may be fixed inthe optimized library, and the others toggled. This approach isrecommended if it is desired not to drastically change the context ofthe other amino acid residues by incorporating too many changes at onetime. Again, the starting point for toggling may be the best set ofresidues as predicted by the model or a best validated protein from anexisting library. Or the starting point may be an “average” clone thatmodels well. In this case, it may be desirable to toggle the residuespredicted to be of higher importance. The rationale for this being thatone should explore a larger space in search for activity hillspreviously omitted from the sampling. This type of library is typicallymore relevant in early rounds as it generates a more refined picture forsubsequent rounds.

Alternatives to the above methodology involve different procedures forusing residue importance (rankings) in determining which residues totoggle. In one such alternative, higher ranked residue positions arefavored for toggling. The information needed in this approach includesthe sequence of a best protein from the training set, a PLS or PCRpredicted best sequence, and a ranking of residues from the PLS or PCRmodel. The “best” protein is a wet-lab validated “best” clone in thedataset (clone with the highest measured function that still modelswell, i.e., falls relatively close to the predicted value in crossvalidation). The method compares each residue from this protein with thecorresponding residue from a “best predicted” sequence having thehighest value of the desired activity. This is accomplished using, e.g.,the loads matrix (described below), starting with the residue having thehighest load. Alternatively, another measure of the PLS or PCRbest-predicted sequence such as highest value of regression coefficientfor each position is used. If the residue with the highest load orregression coefficient is not present in the ‘best’ clone, the methodintroduces that position as a toggle position for the subsequentlibrary. If the residue is present in the best clone, the method willnot treat the position as a toggle position, and it will move the nextposition in succession. The process is repeated for various residues,moving through successively lower load values, until the library is ofsufficient size is generated.

The number of regression coefficient residues to retain, and number ofregression coefficient residues to toggle, can be varied. Factors toconsider include the desired library size, the magnitude of differencebetween regression coefficients, and the degree to which nonlinearity isthought to exist—retaining residues with small (neutral) coefficientsmay uncover important nonlinearities in subsequent rounds of evolution.Typical optimized protein variant libraries of the present inventioncontain about 2^(N) protein variants, where N represents the number ofpositions that are toggled between two residues. Stated another way, thediversity added by each additional toggle doubles the size of thelibrary such that 10 toggle positions produces ˜1,000 clones (1,024), 13positions ˜10,000 clones (8,192) and 20 positions ˜1,000,000 clones(1,048,576). The appropriate size of library depends on factors such ascost of screen, ruggedness of landscape, preferred percentage samplingof space etc. In some cases, it has been found that a relatively largenumber of changed residues produces a library in which an inordinatelylarge percentage of the clones are non-functional. Therefore for someapplications, it may be recommended that the number of residues fortoggling ranges from about 2 to about 13; i.e., the library size rangesfrom between about 4 and 10,000 clones.

In practice, one can pursue various subsequent round library strategiesat the same time, with some strategies being more aggressive (fixingmore “beneficial” residues) and other strategies being more conservative(fixing fewer “beneficial” residues in the hopes of exploring the spacemore thoroughly).

Optimized protein variant libraries can be generated using therecombination methods described herein, or alternatively, by genesynthesis methods, followed by in vivo or in vitro expression. Theoptimized protein variant libraries are then screened for desiredactivity, and sequenced. As indicated above in the discussion of FIG.1A, the activity and sequence information from the optimized proteinvariant library can be employed to generate another sequence activitymodel from which a further optimized library can be designed, using themethods described herein. In one approach, all proteins from this newlibrary are used as part of the dataset.

In varied approaches, a wet-lab validated ‘best’ (or one of the fewbest) protein in the current optimized library (i.e., a protein with thehighest, or one of the few highest, measured function that still modelswell, i.e., falls relatively close to the predicted value in crossvalidation) may serve as a backbone where various schemes of changes areincorporated. In another approach, a wet-lab validated ‘best’ (or one ofthe few best) protein in the current library that may not model well mayserve as a backbone where various schemes of changes are incorporated.In other approaches, a sequence predicted by the sequence activity modelto have the highest value (or one of the highest values) of the desiredactivity may serve as the backbone. In these approaches, the dataset forthe “next generation” library (and possibly a corresponding model) isobtained by changing residues in one or a few of the best proteins. Inone embodiment, these changes comprise a systematic variation of theresidues in the backbone. In some cases, the changes comprise variousmutagenesis, recombination and/or subsequence selection techniques. Eachof these may be performed in vitro, in vivo, or in silico.

Multiple other variations on the above approach are within the scope ofthis invention. As one example, the x_(ij) variables are representationsof the physical or chemical properties of amino acids—rather than theexact identities of the amino acids themselves (leucine versus valineversus proline, . . . ). Examples of such properties includelipophilicity, bulk, and electronic properties (e.g., formal charge, vander Waals surface area associated a partial charge, etc.). To implementthis approach, the x_(ij) values representing amino acid residues can bepresented in terms of their properties or principal componentsconstructed from the properties.

In another variation, the x_(ij) variables represent nucleotides, ratherthan amino acid residues. The goal is to identify nucleic acid sequencesthat encode proteins for a protein variant library. By using nucleotidesrather than amino acids, one can optimize on parameters other thanmerely specific activity. For example, protein expression in aparticular host or vector may be a function of nucleotide sequence. Twodifferent nucleotide sequences may encode a protein having one aminoacid sequence, but one of the nucleotide sequences expresses greaterquantities of protein and/or expresses the protein in a more activestate. By using nucleotide sequences rather than amino acid sequences,the methods of this invention can optimize for expression properties,for example, as well as specific activity.

In a specific embodiment, the nucleotide sequence is represented ascodons. Models may employ codons as the atomic unit of a nucleotidesequence such that the predicted activities are a function of variouscodons in the nucleotide sequence. Each codon together with its positionin the overall nucleotide sequence serves as an independent variable forgenerating sequence activity models. Note that different codons forgiven amino acid express differently in a given organism. Morespecifically, each organism has a preferred codon, or distribution ofcodon frequencies, for a given amino acid. By using codons as theindependent variables, the invention accounts for these preferences.

An outline of a particular method includes the following operations: (a)receiving data characterizing a training set of a protein variantlibrary; (b) from the data, developing a sequence activity model thatpredicts activity as a function of nucleotide types and correspondingposition in the nucleotide sequence; (c) using the sequence activitymodel to rank positions in a nucleotide sequence and/or nucleotide typesat specific positions in the nucleotide sequence in order of impact onthe desired activity; and (d) using the ranking to identify one or morenucleotides, in the nucleotide sequence, that are to be varied or fixedin order to impact the desired activity. As indicated, the nucleotidesto be varied are preferably codons encoding particular amino acids.

Other variations of the above approach involve use of differenttechniques for ranking residues or otherwise characterizing them interms of importance. In the above approach, the magnitudes of regressioncoefficients were used to rank residues. Residues having coefficientswith large magnitudes (e.g., 166 Ile) were viewed as high-rankingresidues. This characterization was used to decide whether or not tovary a particular residue in the generation of a new, optimized libraryof protein variants.

PLS and other techniques provide other information, beyond regressioncoefficient magnitude, that can be used to rank specific residues orresidue positions. Techniques such as PLS and Principle ComponentAnalysis (PCA) or PCR provide information in the form of principlecomponents or latent vectors. These represent directions or vectors ofmaximum variation through multi-dimensional data sets such as theprotein sequence-activity space employed in this invention. These latentvectors are functions of the various sequence dimensions; i.e., theindividual residues or residue positions that comprise the proteinsequences of the variant library used to construct the training set. Alatent vector will therefore comprise a sum of contributions from eachof the residue positions in the training set. Some positions willcontribute more strongly to the direction of the vector. These will bemanifest by relatively large “loads,” i.e., the coefficients used todescribe the vector. As a simple example, a training set may becomprised of tripeptides. The first latent vector will typically havecontributions from all three residues.

Vector 1=a1(residue position 1)+a2(residue position 2)+a3(residueposition 3)

The coefficients, a1, a2, and a3, are the loads. Because these reflectthe importance of the corresponding residue positions to variation inthe dataset, they can be used to rank the importance of individualresidue positions for purposes of “toggling” decisions, as describedabove. Loads, like regression coefficients, may be used to rank residuesat each toggled position. Various parameters describe the importance ofthese loads. Some such Variable Importance in Projection (VIP) make useof a load matrix, which is comprised of the loads for multiple latentvectors taken from a training set. In Variable Importance for PLSProjection, the importance of the ith variable (e.g., residue position)is computed by calculating VIP (variable importance in projection). Fora given PLS dimension, a, (VIN)_(ak) ² is equal to the squared PLSweight (w_(ak))² of a variable multiplied by the percent explainedvariability in y (dependent variable, e.g., certain function) by thatPLS dimension. (VIN)_(ak) ² is summed over all PLS dimensions(components). VIP is then calculated by dividing the sum by the totalpercent variability in y explained by the PLS model and multiplying bythe number of variables in the model. Variables with large VIP, largerthan 1, are the most relevant for correlating with a certain function(y) and hence highest ranked for purposes of making toggling decisions.

Another embodiment of the invention employs techniques that rankresidues not simply by the magnitudes of their predicted contributionsto activity, but by the confidence in those predicted contributions aswell. In the methods described to this point, residues or nucleotides(including codons) are ranked based solely on the magnitude of thecoefficients or principal components identified during model building.In many cases, this works well. But in some cases the researcher will beconcerned with spurious values of the coefficients or principalcomponents.

In a more statistically rigorous approach, the ranking is based on acombination of magnitude and distribution. Coefficients with both highmagnitudes and tight distributions give the highest ranking In somecases, one coefficient with a lower magnitude than another may be givena higher ranking by virtue of having less variation. Thus, someembodiments of the invention rank residues or nucleotides based on bothmagnitude and standard deviation or variance. Various techniques can beused to accomplish this. One of these, a bootstrap p-value approach,will now be described.

An example of a method that employs a bootstrap method is depicted inFIG. 1B. As shown there, a method 125 begins at a block 127 where anoriginal data set S is provided. This may be a training set as describedabove. For example, it may be generated by systematically varying theindividual residues of a starting sequence in any one of the mannersdescribed above. In the example of method 125, the data set S has Mdifferent data points (activity and sequence information collected fromamino acid or nucleotide sequences) for use in the analysis.

From data set S, various bootstrap sets B are created. Each of these isobtained by sampling, with replacement, from set S to create a new setof M members—all taken from original set S. See block 129. The “withreplacement” condition produces variations on the original set S. Thenew bootstrap set, B, will sometimes contain replicate samples from S.And, it may also lack certain samples originally contained in S.

As an example, consider a set S of 100 sequences. Each bootstrap set Bused in the method contains itself 100 sequences. A bootstrap set B iscreated by randomly selecting each of the 100 member sequences from the100 sequences in the original set S. Thus, it is possible that somesequences will be selected more than once and others will not beselected at all.

Using the bootstrap set B currently under consideration, the method nextbuilds a model. See block 131. The model may be built as describedabove, using PLS, PCR, a SVM, genetic programming, etc. This model willprovide coefficients or other indicia of ranking for the residues ornucleotides found in the various samples from set B. As shown at a block133, these coefficients or other indicia are recorded for subsequentuse.

Next, at a decision block 135, the method determines whether anotherbootstrap set should be created. If yes, the method returns to block 129where a new bootstrap set B is created as described above. If no, themethod proceeds to a block 137 discussed below. The decision at block135 turns on how many different sets of coefficient values are to beused in assessing the distributions of those values. The number of setsB should be sufficient to generate accurate statistics. As an example,100 to 1000 bootstrap sets are prepared and analyzed. This isrepresented as about 100 to 1000 passes through blocks 129, 131, and 133of method 125.

After a sufficient number bootstrap sets B have been prepared andanalyzed as described, decision 135 is answered in the negative. Asindicated, the method then proceeds to block 137. There a mean andstandard deviation of a coefficient (or other indicator generated by themodel) is calculated for each residue or nucleotide (including codons)using the coefficient values (e.g., 100 to 1000 of them, one from eachbootstrap set). From this information, the method can calculate thet-statistic and determine the confidence interval that the measuredvalue is different from zero. From the t-statistic it calculates thep-value for the confidence interval. In this case, the smaller p-valuethe more confidence that the measured regression coefficient isdifferent from zero.

Note that the p-value is but one of many different types ofcharacterization that can account for the statistical variation in acoefficient or other indicator of residue importance. Examples includecalculating 95 per cent confidence intervals for regression coefficientsand excluding any regression coefficient for consideration for which 95per cent confidence interval crosses zero line. Basically, anycharacterization that accounts for standard deviation, variance, orother statistically relevant measure of data distribution can be used.Such characterization preferably also accounts for the magnitude of thecoefficients.

A large standard deviation can result from various sources. One sourceis poor measurements in the data set. Another is a limitedrepresentation of a particular residue or nucleotide in the originaldata set. In this latter case, some bootstrap sets will contain nooccurrences of a particular residue or nucleotide. In such cases, thevalue of the coefficient for that residue will be zero. Other bootstrapsets will contain at least some occurrences of the residue or nucleotideand give a non-zero value of the corresponding coefficient. But the setsgiving a zero value will cause the standard deviation of the coefficientto become relatively large. This reduces the confidence in thecoefficient value and results in a lower rank. But this is to beexpected, given that there is relatively little data on the residue ornucleotide in question.

Next, at a block 139, the method ranks the regression coefficients (orother indicators) from lower (best) p-value to highest (worst) p-value.This ranking correlates highly with the absolute value of the regressioncoefficients themselves, owing to the fact that the larger the absolutevalue, the more standard deviations removed from zero. Thus, for a givenstandard deviation, the p-value becomes smaller as the regressioncoefficient becomes larger. However, the absolute ranking will notalways be the same with both p-value and pure magnitude methods,especially when relatively few data points are available to begin within set S.

Finally, as shown at a block 141, the method fixes and toggles certainresidues based on the rankings observed in the operation of block 139.This is essentially the same use of rankings described above for otherembodiments. In one approach, the method fixes the best residues (nowthose with the lowest p-values) and toggles the others (those withhighest p-values).

This method 125 has been shown in silico to perform well. Moreover, thep-value ranking approach naturally deals with single or few instanceresidues: the p-values will generally be higher (worse) because in thebootstrap process, those residues that did not appear often in theoriginal data set will be less likely to get picked up at random. Evenif their coefficients are large, their variability (measured in standarddeviations) will be quite high as well. Intuitively, this is the desiredresult, since those residues that are not well represented (either havenot seen with sufficient frequency or have lower regressioncoefficients) may be good candidates for toggling in the next round oflibrary design.

III. IDENTIFICATION OF TARGET BIO-MOLECULES WITH DESIRED PROPERTIESAND/OR FOR ARTIFICIAL EVOLUTION

A. Library Design Using Pareto Front Optimization for MultipleProperties

The present invention provides methods that utilize Pareto frontoptimization to select clones for carrying out future rounds ofartificial evolution (e.g., DNA shuffling, etc.) in connection with theoptimization of multiple polypeptide properties (i.e., multipleobjectives). Pareto front optimization is a multi-objective evolutionaryalgorithm that simultaneously improves two or more desired objectives.

To illustrate, FIG. 2 provides a graph that illustrates a Pareto frontin a plot of a hypothetical set of data, where function 2 (F2) isplotted as a function of function 1 (F1). Any optimization problem isoptionally cast as a minimization problem, by, e.g., reversing the signof the fitness or inverting the fitness. As shown in FIG. 2, forexample, the axes represent different objectives to be simultaneouslyminimized. The solutions (represented by the numbered data points) thatlie on the Pareto front represent trade-off solutions that are not“dominated” by any other solution. These non-dominated points aredefined by the fact that no other solution exists in the hypotheticaldata set that is better (smaller in this case) than all solutions inboth objectives. For example, solution 1 is part of the Pareto frontbecause, even though solution 2 has a smaller value for objective F2,solution 1 has a smaller value for objective F1. In contrast, solution 7is not part of the Pareto front because at least one solution is betterin both objectives.

FIG. 4 is a chart that depicts certain steps performed in one embodimentof the invention method of identifying members of a population ofbiopolymer sequence variants most suitable for artificial evolution. Thephrase “most suitable for artificial evolution” refers to those membersof the variant population that lie at least proximal to a Pareto front,e.g., when the variants are scored (e.g., screened or selected) andplotted for desired objectives. These variants are generally the mostsuitable for artificial evolution, because they are not dominated byother variants (or at least most other variants) in at least one of thedesired objectives.

As shown in A1 of FIG. 4, the method includes selecting or screening themembers of the population of biopolymer sequence variants (e.g.,character string variants, etc.) for two or more desired objectives toproduce a multi-objective fitness data set. Desired objectives typicallyinclude, e.g., structural and/or functional properties, such as any ofthose described herein. The population of biopolymer sequence variantscan be produced in accordance with the diversity generating proceduresdescribed herein, then screened for activities or other function (i.e.,objectives). Thereafter, the method includes identifying a Pareto front(e.g., substantially convex, substantially non-convex, etc.) in themulti-objective fitness data set (A2), and selecting members proximal tothe Pareto front (A3), thereby identifying the members of the populationof biopolymer sequence variants most suitable for artificial evolution.In the context of the present invention, the “Pareto front” refers tobiopolymer sequence variants that are non-dominated by other biopolymersequence variants in at least one of two or more desired objectives. Insome embodiments, the method further includes evolving the membersselected in A3 using artificial evolution procedures to produce evolvedbiopolymer sequence variants. Various artificial evolution proceduresthat are optionally used to evolve these variants are described herein.At least one step, and in certain cases all steps, of these artificialevolution procedures may be performed in silico. These embodimentsoptionally also include repeating steps A1-A3 using the evolvedbiopolymer sequence variants as at least some of the members of thepopulation of biopolymer sequence variants in a repeated step A1.Typically, at least one step, and some cases all steps, of the methodsdescribed herein are performed in a digital or web-based system. Digitaland web-based systems are described in greater detail below.

In addition, to provide an optimal set of solutions from which toselect, algorithms should generally attempt to evenly distribute ormaximally spread the solutions in objective space along the Paretofront, because clustered solutions typically lack sufficient diversity.Accordingly, algorithms are typically designed to order individualsolutions in a population based upon both fitness along each objectiveand according to their relative isolation in objective space. Thisapproach generally results in a good spread of solutions along thePareto front, even into non-convex regions of objective space.Non-convex Pareto fronts are discussed further below. One approach toselecting solutions based on their relative diversity is the techniqueof region-based selection, which is described further in, e.g., Corne etal., “PESA-II: Region-based selection in evolutionary multiobjectiveoptimization,” in Proceedings of the Genetic and EvolutionaryComputation Conference (GECCO-2001), Morgan Kaufmann Publishers, (2001),pp. 283-290. Region-based selection generally involves partitioning theobjective space into hyperboxes and preferentially selecting solutionsfrom less populated hyperboxes. Other techniques for selecting solutions(e.g., binary tournament selection, etc.), which are generally known inthe art are optionally utilized in practicing the methods describedherein.

One significant advantage of Pareto front optimization is that theapproach does not to reduce the problem at issue to one of singleobjective optimization (e.g., by a weighted sum approach or the like),rather the approach provides a set of optimal solutions from which toselect. Although weighted measures are optionally used to select finalsolutions, not all solutions will be identified via this approach, e.g.,if the Pareto front is non-convex. Accordingly, a simple weighted sum ofobjectives may restrict the ability of an algorithm to find viablesolutions in these instances. The problem posed by non-convexity in theobjective space is further illustrated in FIG. 3, which provides a graphthat shows a plot of a hypothetical set of data. As shown and consistentwith the definition, the set of solutions (represented by numbered datapoints) along the Pareto front are non-dominated. However, classicalweight-based optimization, which is generally known in the art, wouldnot yield solutions 3 and 4 for any weights on objectives F1 and F2, dueto the existence of superior solutions based on the weighted sum.Furthermore, if an approximately equal trade-off for both objectiveswere sought, a whole class of solutions would be excluded using theclassical methods.

Methods of the present invention include various embodiments forselecting sequence variants that are proximal to the Pareto front. Forexample, the methods optionally include applying one or more nichingtechniques to identify the members of the population of biopolymersequence variants most suitable for artificial evolution. Additionaldetails relating to various niching techniques are provided in, e.g.,Darwen et al. (1997) “Speciation as automatic categoricalmodularization,” IEEE Transactions on Evolutionary Computation,1(2):101-108, Darwen et al. (1996), “Every niching method has its niche:fitness sharing and implicit sharing compared,” Proc. of ParallelProblem Solving from Nature (PPSN) IV, Vol. 1141, Lecture Notes inComputer Science, Springer-Verlag, (1996), pp. 398-407, and Horn et al.(1994) “A niched pareto genetic algorithm for multiobjectiveoptimization,” In Proceedings of the First IEEE Conference onEvolutionary Computation, IEEE World Congress on ComputationalComputation, (1):82-87. In other embodiments, sequence variants areselected by, e.g., calculating a weighted sum of the two or more desiredobjectives for at least some of the members proximal to the Paretofront, and selecting at least one member that includes a higher weightedsum than other members proximal to the Pareto front. In still otherembodiments, biopolymer sequence variants are selected by, e.g., rankingthe one or more members according to relative proximity to the Paretofront and relative isolation in sequence space, and selecting at leastone member that ranks higher than other members proximal to the Paretofront. Region-based selection techniques (described above) are alsooptionally used to select members proximal to the Pareto front. Toillustrate, one region-based selection technique includes partitioningsequence space that includes the population of biopolymer sequencevariants into one or more hyperboxes and selecting the members proximalto the Pareto front from at least one of the hyperboxes that is lesspopulated than other regions of the sequence space.

To further illustrate, FIG. 5 is a chart that depicts certain stepsperformed in one embodiment of a method of identifying members of a setof biopolymer character string variants that include multiple improvedobjectives relative to other members of the set of biopolymer characterstring variants. As shown, the method includes applying one or moremulti-objective evolutionary algorithms to at least one parentalbiopolymer character string (e.g., a plurality of parental biopolymercharacter strings or the like) to produce the set of biopolymercharacter string variants (B1), and selecting or screening the membersof the set of biopolymer character string variants for two or moredesired objectives (B2). As further shown, the method also includesplotting the set of biopolymer character string variants as a functionof the two or more desired objectives to produce a biopolymer characterstring variant plot (e.g., as depicted in FIG. 2 or 3) (B3), andidentifying a Pareto front (e.g., substantially convex, substantiallynon-convex, etc.) in the biopolymer character string variant plot (B4),thereby identifying the members of the set of biopolymer characterstring variants that include the multiple improved objectives relativeto the other members of the set of biopolymer character string variants.The method is optionally iteratively performed, e.g., repeating stepsB1-B4 using at least one member of the set of biopolymer characterstring variants as a parental biopolymer character string in a repeatedstep B1. In some embodiments, the methods further include synthesizingpolynucleotide or polypeptide sequence variants that correspond tomembers of the set of biopolymer character string variants identified instep B4.

In preferred embodiments, members proximal to the Pareto front in agiven analysis are maximally spread apart (e.g., substantially evenly oruniformly distributed) from one another, e.g., to enhance diversityamong identified solutions, as described above. In other embodiments,the sequence variants proximal to the Pareto front are substantiallyunevenly distributed (e.g., randomly or non-uniformly distributed). Inaddition, the biopolymer character string variant plots are optionallypresented as, e.g., maximization or minimization plots.

Many different desired objectives are optionally screened or selectedaccording to these methods. To illustrate, each of the two or moredesired objectives typically independently include a physicochemical orfunctional property. In some embodiments, the two or more desiredobjectives include, e.g., constraints, values detailing distance fromachieving constraints, a total number of constraints satisfied, and/or arelative number of constraints satisfied. Optionally, the two or moredesired objectives include measures of fitness, competing ornon-competing objectives, or the like. Furthermore, the two or moredesired objectives are also optionally orthogonal to one another.

In other aspects, the invention provides systems for identifying membersof a set of biopolymer character string variants that include multipleimproved objectives relative to other members of the set of biopolymercharacter string variants. The systems include a computer having adatabase capable storing the set of biopolymer character stringvariants. The systems also include system software that includes logicinstructions for applying multi-objective evolutionary algorithms toparental biopolymer character strings to produce the set of biopolymercharacter string variants, and selecting or screening the members of theset of biopolymer character string variants for two or more desiredobjectives. The system software also includes logic instructions forplotting the set of biopolymer character string variants as a functionof the two or more desired objectives to produce a biopolymer characterstring variant plot, and identifying a Pareto front in the biopolymercharacter string variant plot. Systems are described in greater detailbelow.

The invention also provides a computer program product that includes acomputer readable medium having logic instructions for applyingmulti-objective evolutionary algorithms to parental biopolymer characterstrings to produce a set of biopolymer character string variants, andselecting or screening the members of the set of biopolymer characterstring variants for two or more desired objectives. In addition, thecomputer program product includes logic instructions for plotting theset of biopolymer character string variants as a function of the two ormore desired objectives to produce a biopolymer character string variantplot, and identifying a Pareto front in the biopolymer character stringvariant plot to identify the members of the set of biopolymer characterstring variants that include multiple improved objectives relative toother members of the set of biopolymer character string variants.

To assist in selecting clones from a given experiment to furtherdevelop, e.g., via the artificial evolution procedures described herein,systems and computer program products of the invention generally includelogic instructions that rank clones in terms of, e.g., their proximityto the Pareto front, by their relative isolation, and/or the like. Thisprovides for extensive diversity along the Pareto front with theconcomitant benefits of such diversity, as described above. Further, thebest clones along the most advanced Pareto front are optionally selectedat sampling rates (e.g., DNA concentrations, etc.) based on theirmodified fitness values. This allows clones from less populated areas ofobjective space to be sampled more often, which again promotes diversityin subsequent rounds of artificial evolution. A weighted sum of theactivities after evolution is optionally used to select the “best”clone. However, researchers have found that using a weighted sum of theactivities during evolution results in a single objective optimizationwith low diversity along the Pareto front.

In addition, niching techniques (mentioned above) are optionally appliedto select clones for development. For example, in multi-modalsingle-objective optimization, research has shown that niching can bebeneficial under certain circumstances. The idea is simply toartificially evolve those individuals in the population that are similargenotypically and which occupy high fitness areas. The reasoning is thatmotifs brought together from different modes in fitness space may notlead to better function. Indeed, they often lead to noise anddisruption. In the context of multi-objective optimizations, asimplified toy problem may be simulated (e.g., using Kaufmann's NKmodel, etc.) to determine whether niching assists or hinders evolutionalong the Pareto front. See, e.g., Kauffman, The Origins of Order,Oxford University Press (1993) and Kaufmann and Johnsen, “Co-Evolutionto the Edge of Chaos: Coupled Fitness Landscapes, Poised States, andCo-Evolutionary Avalanches,” in Langton et al., Artificial Life II:Proceedings of the Second Artificial Life Workshop, Addison-Wesley(1992), pp. 325-369. In particular, it may depend on the relativeruggedness of each objective's fitness space. For example, motifs thatconfer, e.g., thermostability may be additive, while motifs that confer,e.g., activity under different pH conditions may be competitive andattempts to make large jumps in multi-objective fitness space may leadto high dead rates.

B. In Silico Evolution

The present invention includes methods of optimizing libraryconstruction via in silico evolution of libraries using evolutionarysearch algorithms, including genetic algorithms and Monte Carlo methods,which are described herein. These methods maximize the successful invivo and/or in vitro evolution of essentially any genetic material,including genes, operons, pathways, promoters, regulatory elements,genomes, or the like.

More specifically, FIG. 6 provides a chart depicting certain stepsperformed in a method embodiment for evolving libraries for directedevolution in which the library (L) is the unit of evolution in thealgorithm. Each library is described by parameters such as sequencediversity, recombination method, experimental conditions, and/or thelike. Additional parameters are described herein. The parameters aretypically changed or otherwise evolve during the evolution process. Asshown in C2, the methods include providing a population of libraries(e.g., an initial population of libraries (C1)), such as populations ofbiopolymer character string variants. The algorithm includes a set ofoperators (O) that operates on the unit L to produce a new population oflibraries (C3). For example, the operations include adding and deletingdiversity, changing recombination rates and frequencies, and/or thelike. Additional details regarding operators that are optionally used inthese methods are provided herein. In particular, the operator acts on apopulation of libraries to create the next generation of the population.As shown in C4, this next generation is then selected for fitness (F) toproduce a fitter population of libraries (C5) and this process isiterated (C6). This evolutionary algorithm is typically stopped whendesired characteristics (e.g., levels of fitness) for the libraries aremet. Optionally, the selection process involves designingoligonucleotides using algorithms for facilitating the identification ofdata sequences corresponding to biological polymers andenumerating/simulating the outcome of an experiment followed by insilico estimation of the activities of the clones. Each library is thentypically characterized by a fitness function that involves determining,e.g., mean activity of the clones, standard deviation of the activitiesof the clones, genetic diversity among clones, experimental simplicityof the library, etc. The activities of the clones can also becharacterized by neural networks, PCA or other prediction tools or bystructural compatibility, dynamics simulation and other biophysicalmethods and/or by other techniques described herein.

To further illustrate these aspects of the invention, FIG. 7 provides achart that shows certain steps performed in an embodiment of a method ofproducing a fitter population of character string libraries thatutilizes various operators. At least one step, and in certain cases allsteps, of the method is/are typically performed in silico, e.g., in adigital system described herein. As shown, step D1 includes applying oneor more operators to an initial population of character string librariesto produce an evolved population of character string libraries.Typically, one or more character strings in the initial population ofcharacter string libraries correspond to one or more polynucleotides orone or more polypeptides. After assigning a level of fitness (e.g.,screening or selecting for, e.g., desired structural properties, desiredfunctional properties, and/or the like) to members of the evolvedpopulation of character string libraries (D2), the method includesselecting members of the evolved population of character stringlibraries with higher levels of fitness than other members of thepopulation to produce a fitter population of character string libraries(D3). The method further includes repeating steps D1-D3 using the fitterpopulation of character string libraries as the initial population ofcharacter string libraries in a repeated step D1, e.g., until a desiredlevel of fitness is reached in at least one character string library.

In certain embodiments, step D1 includes (i) providing sets ofdegenerate substrings based upon the initial population of characterstring libraries members, (ii) recombining the sets of degeneratesubstrings to produce desired systematically varied character strings,and (iii) estimating one or more activities of the desiredsystematically varied character strings to produce the evolvedpopulation of character string libraries. In some embodiments, one ormore members of the initial population of character string libraries aredefined by an algorithm that takes one or more parameters, whichparameters evolve during step D1. Exemplary parameters include, e.g.,character string diversity, modeled evolution method utilized, modeledexperimental conditions utilized, PCA modeling, PLS modeling, mutationmatrices, relative importance of, e.g., individual character strings orlibraries, scoring systems for some or all parameters utilized, and/orthe like. The initial population of character string libraries generallyincludes between about two and about 10⁵ libraries. In addition, eachcharacter string library of the initial population of character stringlibraries typically includes between about two and about 10⁵ members.

Many different operators are optionally used in practicing thesemethods. These include, e.g., a mutation of one or more members of thecharacter string libraries, a multiplication of one or more members ofthe character string libraries, a fragmentation of one or more membersof the character string libraries, a crossover between members of thecharacter string libraries, a ligation of one or more members of thecharacter string libraries or substrings of the one or more members ofthe character string libraries, an elitism calculation, a calculation ofsequence homology or sequence similarity of aligned character strings, arecursive use of one or more genetic operators for evolution of one ormore members of the character string libraries, an application of arandomness operator to one or more members of the character stringlibraries, a deletion mutation of one or more members of the characterstring libraries, an insertion mutation into one or more members of thecharacter string libraries, subtraction of one or more members of thecharacter string libraries, selection of one or more members of thecharacter string libraries with desired activities, death of one or moremembers of the character string libraries, or the like. See e.g., WO00/42560; WO 01/75767. The operators are generally included ascomponents of evolutionary search algorithms. Preferred evolutionarysearch algorithms include genetic algorithms, Monte Carlo algorithms,and/or the like, which are also described further herein.

Levels of fitness are typically assigned to each member of the evolvedpopulation of character string libraries using fitness functions.Exemplary fitness functions optionally include, e.g., determining meanactivities of members of each character string library, determiningstandard deviations of activities of members of each character stringlibrary, determining levels of character string diversity among membersof each character string library, modeling an experimental simplicity ofeach character string library, determining a level of confidence inmeasured or predicted values, and/or the like. In preferred embodiments,the activities of the members are determined using multivariate analysistechniques and/or biophysical analysis techniques. For example,multivariate analysis techniques optionally include, e.g., neuralnetwork training techniques, principal components analyses, partialleast squares analyses, and/or the like. Typical biophysical analysistechniques include one or more of, e.g., structural compatibilityanalyses, dynamics simulations, hydrophobicity analyses, solubilityanalyses, immunogenicity analyses, binding assays, enzymaticcharacterizations, or the like. Multivariate analysis and biophysicalanalyses are described further herein.

Members of the fitter population of character string libraries generallycorrespond to polynucleotides or polypeptides. Although the steps ofthese methods are typically performed in silico (e.g., using a digitalsystem, a web-based system, etc.), the methods optionally furtherinclude synthesizing, e.g., one or more of the polynucleotides orpolypeptides corresponding to one or more members of the fitterpopulation of character string libraries to produce synthesizedpolynucleotides or polypeptides. In addition, the methods alsooptionally include, e.g., selecting or screening the synthesizedpolynucleotides or polypeptides for at least one desired property toproduce screened or selected polynucleotides or polypeptides. Typically,the synthesized polynucleotides or polypeptides are screened in vitro orin vivo. Various screening techniques used in practicing these methodsare described herein. The methods optionally further include subjectingthe screened or selected polynucleotides or polypeptides to one or moreartificial evolution procedures. At least one step of the one or moreartificial evolution procedures is optionally performed in silico, e.g.,using character string representations of the polynucleotides orpolypeptides.

In another aspect, the invention relates to a system for producing afitter population of character string libraries. The system includes (a)at least one computer that includes a database capable of storing atleast one population of character string libraries, and (b) systemsoftware including one or more logic instructions. The logicinstructions are typically for, e.g., (i) applying one or more operatorsto an initial population of character string libraries to produce anevolved population of character string libraries, (ii) assigning a levelof fitness to at least one member of the evolved population of characterstring libraries, (iii) selecting one or more members of the evolvedpopulation of character string libraries with higher levels of fitnessthan other members of the evolved population of character stringlibraries to produce the fitter population of character stringlibraries, and (iv) repeating steps (i)-(iii) using the fitterpopulation of character string libraries as the initial population ofcharacter string libraries in a repeated step (i). The system typicallyfurther includes a polynucleotide or a polypeptide synthesis devicecapable of synthesizing polynucleotides or polypeptides that correspondto members of the fitter population of character string libraries.Systems are described in greater detail below.

The invention also provides a computer program product that includes acomputer readable medium having one or more logic instructions for (a)applying one or more operators to an initial population of characterstring libraries to produce an evolved population of character stringlibraries, and (b) assigning a level of fitness to at least one memberof the evolved population of character string libraries. The computerprogram product also include logic instructions for (c) selecting one ormore members of the evolved population of character string librarieswith higher levels of fitness than other members of the evolvedpopulation of character string libraries to produce the fitterpopulation of character string libraries, and (d) repeating steps(a)-(c) using the fitter population of character string libraries as theinitial population of character string libraries in a repeated step (a).

C. Making Libraries from Heuristically-Derived Models

The following discussion supplements the above described aspect of theinvention presented in FIG. 1A. It also presents some alternativeembodiments and elaborates on some previously introduced concepts. Itdoes not limit the above discussion.

As described herein, having access to data sets of systematically variedsequences with measured activities enables the generation of variousmodels. This description illustrates how to implement these models inthe construction of preferred libraries. Although other modelingtechniques, many of which are described herein, are optionally also usedto construct/score libraries, PLS models are emphasized in this sectionfor purposes of clarity. In particular, one alternative to decide on thesequence space to search involves isolating the loads (e.g.,relationships to function) for each amino acid residue in a givenalignment. For example, loads are typically found stored as a matrix inthe model generated by, e.g., any standard PLS modeling tool and can beretrieved, e.g., from a File_Name.loads matrix.

In overview, the importance for each residue and best, for example, 5%of residue pairs (defined as cross products in the matrix) is optionallydetermined using PLS or the like, and the relative importance is givenas load (if one component is used), regression coefficient, VIP(variable importance for projection), etc. Optionally, loads aresubsequently sorted, e.g., according to numerical value. The preferredamino acid in each position in the particular protein having two or moreoptional amino acids will be determined by the corresponding amino acidhaving the highest load, regression coefficient, VIP, etc. A “hero”clone having the theoretically best sequence (i.e., encodes the aminoacid option having the highest load in each position) is thusdetermined. Further, for models generating more than one latentvariable, regression coefficients or similar parameters can also beused.

As explained, these approaches may initially include identifying thewet-lab validated “best” clone in a particular data set, which istypically the clone with the highest measured function that still modelswell (i.e., falls relatively close to the predicted value in PLS crossvalidation). Each residue in the best clone is typically compared withthose from the loads matrix, e.g., starting with the residue having thehighest load. If the residue with the highest load is not present in the“best” clone, that position is introduced as a toggle in the subsequentlibrary. In some embodiments, the residues to toggle are determined bysorting each residue by increasing VIP and omitting those that are wellcharacterized in the model (i.e., exist in the data set as manyinstances and are systematically varied). This can most easily be doneby retaining only those that occur as single (and double if the data setis large enough) instances. A library of two would thus encode the“hero” clone and toggle of the residue having VIP closest to zero andonly present in a single instance in the data set. A library of 4 (2²)would toggle the two lowest VIP residues with single instances, etc.These processes are repeated until the library reaches a selected orsufficient size. Each added diversity represented by a toggle, doublesthe size of the library such that 10 positions equal approximately 1,000clones (1,024), 13 positions equal approximately 10,000 clones (8,192),20 positions equal approximately 1,000,000 clones (1,048,576), etc. Theappropriate library size depends on factors such as cost of screen,ruggedness of landscape, preferred percentage sampling of space, and thelike. Optionally, residues having small loads are toggled, e.g., tosearch the local space surrounding an already validated “best” clone. Anadditional option includes starting with an average clone that modelswell and toggling the high loads, e.g., to explore larger space insearch for activity hills previously omitted from the sampling. Thistype of library is generally more relevant at the early rounds, becauseit generates a more refined picture for subsequent rounds. As anadditional filter, one can omit residues that are originally derivedfrom non-natural diversity. The rationale being that naturally existingdiversity has a higher probability of encoding functionality than doesrandomly occurring diversity, which may or may not be true.

To further illustrate, FIG. 8 is a chart that shows certain stepsperformed in an embodiment of a method of selecting amino acid positionsin a polypeptide variant to artificially evolve, which steps aretypically performed in a digital or web-based system. As shown, themethods include providing a population of polypeptide variants (E1) andscoring (e.g., in silico) members of the population of polypeptidevariants (e.g., character string variants, etc.) for one or more desiredproperties (e.g., structural and/or functional properties) to produce apolypeptide variant data set (E2). The population of polypeptidevariants is generally provided by one or more artificial evolutionprocedures. In addition, at least one step (and often more) of theartificial evolution procedures is typically performed in silico.Populations of polypeptide variants typically include, e.g., betweenabout two and about 10⁶ members. In preferred embodiments, members ofthe population of polypeptide variants are systematically variedsequences.

The methods further include correlating amino acids in amino acidpositions in the polypeptide variants with the one or more desiredproperties using the polypeptide variant data set to produce a loadsmatrix (e.g., a qualitative matrix (e.g., including amino acididentities, etc.), a quantitative matrix (e.g., includingphysicochemical properties, such as hydrophobicity measures, etc.), acategorical matrix (e.g., whether amino acids are charged, bulky, etc.),and/or the like), e.g., representing amino acid contributions to thedesired properties (E3). For example, if two polypeptide sequences areidentical except for a single amino acid residue, and the sequences havedifferent activities, then all difference in function is typicallyassumed to correlated only with that amino acid difference. Accordingly,essentially any way that the relative importance for a given variabletowards a functional parameter Y can be scored is optionally used inthese methods. To illustrate, the matrix is optionally based onregression-based algorithms, e.g., PLS, regression coefficients, VIP(Variable Importance for Projection) (one preferred algorithm), MLR(multiple linear regression), ILS (inverse least square), PCR (principalcomponent regression), and/or the like. Additional alternatives includebasing the loads matrix on pattern-based algorithms, such as neuralnetworks, CART (classification and regression trees), MARS (multivariateadaptive regression splines), and/or the like. The methods alsotypically include sorting entries in the loads matrix, e.g., accordingto numerical value, etc.

As shown in step E4, the methods also include identifying one or moreamino acid differences between at least one member selected from thepopulation of polypeptide variants and corresponding entries in theloads matrix, thereby selecting the amino acid positions in thepolypeptide variant to artificially evolve (e.g., toggle with variableamino acid residues). For example, the preferred solution is to pick amember that is “best” or highest scoring in the preferred function orset of functions (e.g., as long as it fits the model reasonably well)and pick residues to evolve on that member. Typically, between about twoand about 100 amino acid positions in the polypeptide variant areselected to artificially evolve. Optionally, all amino acid positions ina given variant are selected. In certain embodiments, the at least onemember selected from the population of polypeptide variants in E4includes a highest scoring member from E2. The methods typically furtherinclude artificially evolving one or more of the amino acid positionsselected in E4 to produce an evolved polypeptide library. In addition,the methods optionally also include repeating E1-E4 using the evolvedpolypeptide library as the population of polypeptide variants in arepeated E1. Evolved polypeptide libraries optionally include physicalor computational libraries. Physical libraries typically include, e.g.,between about two and about 10⁶ members. In contrast, computationallibraries typically include, e.g., between about two and about 10²⁰members.

As referred to above, in preferred embodiments, loads matrices aregenerated from polypeptide variant data sets using variousheuristically-derived modeling techniques, including regression-basedalgorithms, pattern-based algorithms, and/or the like. Exemplaryregression-based algorithms include, e.g., partial least squaresregression, multiple linear regression, inverse least squaresregression, principal component regression, variable importance forprojection, etc. Exemplary pattern-based algorithms include, e.g.,neural networks, classification and regression trees, multivariateadaptive regression splines, and/or the like. In certain preferredembodiments, E3 includes generating a partial least squares model fromthe polypeptide variant data set to produce the loads matrix. Thepartial least squares model typically generates more than one latentvariable. The methods also typically further include using regressioncoefficients.

In preferred embodiments, step E4 includes comparing one or more aminoacid positions in the at least one member with one or more correspondingamino acid positions from the loads matrix to identify at least oneamino acid in the loads matrix that is absent in the member to selectthe amino acid positions in the polypeptide variant to artificiallyevolve. Generally, each amino acid position in the at least one memberis compared with each corresponding amino acid position from the loadsmatrix. Selected amino acid positions are optionally artificiallyevolved by substituting one or more corresponding amino acids from theloads matrix. In addition, the member selected from the population ofpolyp eptide variants typically includes a higher scoring member (e.g.,the highest scoring member) of the polypeptide variant data set thanother members of the polypeptide variant data set. For example, thehigher scoring member is typically proximal to a predicted score in apartial least squares cross validation. The amino acid positions fromthe loads matrix that include higher loads are typically compared priorto the amino acid positions from the loads matrix that include lowerloads. Optionally, the amino acid positions from the loads matrix thatinclude lower loads are compared prior to the amino acid positions fromthe loads matrix that include higher loads. In some embodiments, themember selected from the population of polypeptide variants includes asubstantially average scoring member of the polyp eptide variant dataset. In these embodiments, the amino acid positions from the loadsmatrix that include higher loads are typically compared prior to theamino acid positions from the loads matrix that include lower loads.

FIG. 9 is a chart that shows certain steps performed in anotherembodiment of these methods of selecting amino acid positions in apolypeptide variant to artificially evolve. As shown, the methodincludes providing a population of polypeptide variants (F1), andscoring members of the population of polypeptide variants for one ormore desired properties to produce a polypeptide variant data set (F2).In step F3, a partial least squares model is generated from thepolypeptide variant data set, which partial least squares modelcorrelates amino acid positions in the polypeptide variants with the oneor more desired properties to produce a loads matrix. The methods alsoinclude identifying one or more amino acid differences between at leastone member selected from the population of polypeptide variants and theloads matrix from the partial least squares model, thereby selectingamino acid positions in the polypeptide variant to artificially evolve(F4).

The invention also provides a system for selecting amino acid positionsin a polypeptide character string variant to artificially evolve. Thesystem includes (a) a computer that includes a database capable ofstoring at least one population of polyp eptide character stringvariants, and (b) system software. The system software includes one ormore logic instructions for (i) providing one or more populations ofpolypeptide character string variants, and (ii) scoring members of theone or more populations of polypeptide character string variants for oneor more desired properties to produce a polypeptide character stringvariant data set. The software also includes instructions for (iii)correlating amino acids in amino acid positions in the polypeptidecharacter string variants with the one or more desired properties usingthe polypeptide character string variant data set to produce a loadsmatrix representing amino acid contributions to the one or more desiredproperties, and (iv) identifying one or more amino acid differencesbetween at least one member selected from the one or more populations ofpolypeptide character string variants and corresponding entries in theloads matrix. Additional details relating to various aspects of thesystems of the present invention are provided below.

In addition, the invention relates to a computer program product forselecting amino acid positions in a polypeptide character string variantto artificially evolve. The computer program product includes a computerreadable medium having one or more logic instructions for (a) providingone or more populations of polypeptide character string variants, and(b) scoring members of the one or more populations of polypeptidecharacter string variants for one or more desired properties to producea polypeptide character string variant data set. The program alsoincludes instructions for (c) correlating amino acids in the amino acidpositions in the polypeptide character string variants with the one ormore desired properties using the polypeptide character string variantdata set to produce a loads matrix representing amino acid contributionsto the one or more desired properties, and (d) identifying one or moreamino acid differences between at least one member selected from the oneor more populations of polypeptide character string variants andcorresponding entries in the loads matrix.

D. Using Cross Products in Heuristically-Derived Models for SequenceSpace Exploration

Interactions (e.g., second order, third order, etc.) among amino acidresidues are important for protein sequence-activity (function)relationships (PSAR (PSFR)). Another aspect of the invention involvescalculating cross product terms, i.e., co-varying residues, amongvarious columns corresponding to amino acid residue positions in amatrix. A detailed description of covariation phenomena is provided inthe Examples below. The cross product terms are then typically added tothe linear terms, which correspond to amino acid residues, and anexpanded X predictor matrix is generated. Heuristically-derived modelsare generated with the expanded predictor matrix to identify importantcross terms along with linear terms. This cross product and linear terminformation is then typically utilized in the construction of subsequentlibraries. For example, two amino acid residues alone may not beimportant, e.g., as manifested by weights of linear terms in PLS or PCRmodeling, but their cross product term may be important. Accordingly,the corresponding amino acid positions may be good candidates forexploration in subsequent rounds of artificial evolution to ensureoptimal sequence space searching.

To further illustrate, FIG. 10 is a chart that shows certain stepsperformed in an embodiment of a method of identifying amino acids inpolypeptides that are important for a polypeptide sequence-activityrelationship. As shown in G1, the methods include providing an Xpredictor matrix that includes a data set corresponding to a set ofpolypeptide sequence variants in which at least a subset of the set ofpolypeptide sequence variants include one or more measured activities.The set of polypeptide sequence variants typically includes, e.g., a setof systematically varied polypeptide sequences or the like, e.g.,produced by one or more diversity generating or artificial evolutionprocedures, such as any of those described herein. As further shown inG2, the methods also include calculating one or more cross product termsbetween or among columns of the X predictor matrix. Each column entrycorresponds to an amino acid of a polypeptide sequence variant from theset of polypeptide sequence variants. In addition, the methods alsoinclude adding at least one of the one or more cross product termscalculated in step G2 to one or more linear terms of the X predictormatrix to produce an expanded X predictor matrix (G3). Cross productterms identify covarying amino acids in the polypeptides, whereas thelinear terms correspond to amino acids in the polypeptide sequencevariants. Thereafter, the methods include generating a model with theexpanded X predictor matrix to identify important cross product termsand/or linear terms, thereby identifying the amino acids in thepolypeptides that are important for the polypeptide sequence-activityrelationship (G4).

Optionally, the heuristically-derived models are produced using one ormore regression-based algorithms selected from, e.g., a partial leastsquares regression, a multiple linear regression, an inverse leastsquares regression, a principal component regression, a variableimportance for projection, or the like. As an additional option, themodel is produced using one or more pattern-based algorithm selectedfrom, e.g., a neural network, a classification and regression tree, amultivariate adaptive regression spline, or the like.

Typically, the important cross product terms and/or linear termsidentified in G4 are used to design one or more polypeptide libraries.As mentioned, in certain aspects, two or more linear terms individuallymay include unimportant terms for the polypeptide sequence-activityrelationship. However, cross product terms calculated from the same twoor more linear terms may be identified as important for the polypeptidesequence-activity relationship. Cross product terms typically correspondto interactions between or among amino acids in the polypeptide sequencevariants. For example, the interactions include, e.g., secondary ortertiary interactions, direct interactions, indirect interactions,physicochemical interactions, interactions due to folding intermediates,translational effects, and/or the like. Sequence-activity informationderived from covariation analysis (i.e., cross product terms) can beused in a method for characterizing the covariation in a polypeptidelibrary by:

(a) identifying varying amino acid residues in a character stringpopulation that represents a population of homologous parentalpolypeptides;

(b) identifying amino acid residues in the character string populationthat covary with one another to produce a parental covariation data set;

(c) providing a set of overlapping synthetic oligonucleotides comprisingmembers that encode one or more covarying amino acid residues identifiedin the character string population,

wherein each synthetic oligonucleotides encodes at most one member of aset of amino acid residues that covary with each other;

(d) recombining the overlapping synthetic oligonucleotides to produce aset of recombined polynucleotides that encode progeny of the homologousparental polypeptides,

(e) expressing at least a subset of the set of recombinedpolynucleotides to produce a set of progeny polypeptides;

(f) selecting or screening at least a subset of the progeny polypeptidesfor a desired property;

(g) sequencing one or more progeny polypeptides, or one or morerecombined polynucleotides that encode the one or more progenypolypeptides, that comprise the desired property to produce a progenysequence data set;

(h) identifying one or more pairs of amino acid residues in the progenysequence data set that covary with one another to produce a progenycovariation data set; and

(i) identifying differences between the parental and progeny covariationdata sets, thereby characterizing the covariation in the population ofhomologous polypeptides.

These aspects of the invention are also embodied in a system foridentifying amino acids in polypeptides that are important for apolypeptide sequence-activity relationship. The system includes (a) acomputer that includes a database capable of storing at least onepopulation of character string libraries, and (b) system software. Thesystem software includes one or more logic instructions for (i)providing an X predictor matrix that includes a data set correspondingto a set of polypeptide sequence variants in which at least a subset ofthe set of polypeptide sequence variants include one or more measuredactivities, and (ii) calculating one or more cross product terms betweenor among columns of the X predictor matrix in which each column entrycorresponds to an amino acid of a polypeptide sequence variant from theset of polypeptide sequence variants. The software also includesinstructions for (iii) adding at least one of the one or more crossproduct terms calculated in step (ii) to one or more linear terms of theX predictor matrix to produce an expanded X predictor matrix, and (iv)generating a model with the expanded X predictor matrix to identifyimportant cross product terms and/or linear terms. Additional detailsregarding the systems of the invention are described below.

The invention also provides a computer program product for identifyingamino acids in polypeptides that are important for a polypeptidesequence-activity relationship. The computer program product includes acomputer-readable medium having one or more logic instructions for (a)providing an X predictor matrix that includes a data set correspondingto a set of polypeptide sequence variants in which at least a subset ofthe set of polypeptide sequence variants include one or more measuredactivities, and (b) calculating one or more cross product terms betweenor among columns of the X predictor matrix in which each column entrycorresponds to an amino acid of a polypeptide sequence variant from theset of polypeptide sequence variants. The program also includesinstructions for (c) adding at least one of the one or more crossproduct terms calculated in (b) to one or more linear terms of the Xpredictor matrix to produce an expanded X predictor matrix, and (d)generating a model with the expanded X predictor matrix to identifyimportant cross product terms and/or linear terms.

E. Protein Variant Library Design Incorporating Evolutionary Information

While it may be desirable to vary amino acid residues in a large numberof positions in a single protein variant library, doing so may lead to alibrary with a large number of variants having little or no activity dueto deleterious combinations of too many variable residues. The presentinvention provides an efficient way of optimizing a protein variant fora desired activity by making one or more protein variant libraries thatincorporate only certain variable amino acid residue substitutions froma set of parental polypeptides. The set of variable amino acid residuesare selected for incorporation into a protein variant library based onthe evolutionary context of the variable amino acid residue Thosesubstitutions that represent evolutionarily conservative substitutionsare incorporated into protein variants of the library.

Amino acid changes allowed by evolution generally conserve fold andfunction of proteins. On relatively short evolutionary timescales,allowed changes tend to be context independent, that is, make an“additive” fitness contribution (and work well with other changes).Essentially infinite sources of homologues on any desired divergencetimescale can be accessed by “allowed” amino acid changes for thattimescale. There is also evidence that subtle perturbations in proteinstructure can have a huge impact on function (Kidokoro (1998) “Design ofprotein function by physical perturbation method,” Adv. Biophys.35:121-143, and Shimotohno et al. (2001) “Demonstration of theimportance and usefulness of manipulating non-active-site residues inprotein design,” J. Biochem. (Tokyo) 129:943-948).

The present invention provides methods for searching sequence space bymaking evolutionarily conservative substitutions to generate diversitywith high fitness levels. According to the methods, for example,parental sequences are aligned to determine which residues vary betweenparental sequences (i.e., are flexible), then an evolutionarysubstitution matrix is applied to identify a subset of the variableresidues that represent conservative substitutions. A protein variantlibrary is then generated that incorporates the conservative subset ofvariable amino acid residues into the sequences of the protein variants.Alternatively, other substitution matrices can be used to identify thesubset of variable residues to incorporate into a protein variantlibrary. Other suitable substitution matrices include those based onphysicochemical properties or other parameters described herein.Optionally, the methods can be applied to single sequences by applying auser-defined filter or constraint, such as that cysteine, proline, andglycine residues remain unchanged (i.e., are less tolerant to change),and then apply a substitution matrix to the other residues.

Typically, a substitution matrix, such as Dayhoff's PAM matrices (forvarious PAM distances), site dependent matrices, BLOSUM matrices, JTTmatrices, simply binary matrices that capture any amino acidclassification, and the like can be used to create different timescales(see, e.g., Dayhoff and Eck (1968) “A model of evolutionary change inproteins,” Atlas of Protein Sequence and Structure 3:33-41, and Henikoffand Henikoff (1992) “Amino acid substitution matrices from proteinblocks,” Proc. Nat'l. Acad. Sci. USA 89:10915-10919). Tuning theprobability of transition from one amino acid to another can change thelevel of conservation. Both the probability cutoff and the matrix itselfare parameters in the model. There are several other matrices that arealso available. These matrices can be structure dependent, that is, theinside core of a protein has patterns of substitution that may differfrom the external surface of the protein, helices can have differentpatterns from strands, and the like (Koshi and Goldstein (1997)“Mutation matrices and physical-chemical properties: correlations andimplications,” Proteins 27:336-344, and Koshi and Goldstein (1996)“Correlating structure-dependent mutation matrices withphysical-chemical properties,” Pac. Symp. Biocomput. 488-499). Aphysicochemical property-based matrix can also be used to selectsuitable substitutions. Additional details regarding substitutionmatrices suitable for use in the present invention are discussed furtherin, e.g., Durbin et al., Biological Sequence Analysis: ProbabilisticModels of Proteins and Amino Acids, Cambridge University Press (1998).In using any of the above matrices, a library of variant polypeptidesthat incorporates conservative diversity and/or non-conservativediversity, can be made. For non-conservative libraries, substitutionsthat are less likely to happen under divergent evolution are typicallyselected.

When structures of the proteins of interest are available,regions/residues can be identified that will have the desired impact onprotein function. This can be achieved by, e.g., simple modeling ofchanges in electrostatics around active sites or changes that lead tomodified dynamics in the protein (Kidokoro, supra). Structuralinformation can also be used to identify domain/modules that will havethe most impact and one can limit their efforts only to that selectedregion of the proteins.

Algorithms of the present invention can be used to construct a series oflibraries, for any given gene, with a continuum of median fitness, acontinuum of genetic and phenotypic variance, and a high level ofadditive genetic variability. The algorithms are essentially “automatic”in the sense that they are implemented relatively independent of expertknowledge of the protein.

As an overview of these methods, FIG. 11 provides a chart that depictscertain steps performed in one method embodiment for efficientlysearching sequence space. As shown, the method includes identifying aninitial gene or gene family (i.e., gene of interest) (H1), obtainingsequences of homologues spanning a desired evolutionary timescale (H2),and evaluating the number and type of amino acid changes (e.g., withrespect to the polypeptide encoded by the initial gene) that areidentified as a function of time/probability (P) (i.e., indicated bytimescale or probability of such mutation to occur in nature; level ofconservation) (H3). The method also includes evaluating potentiallibrary diversity as a function of time/probability (H4), andidentifying the number of variable positions at the given timescale thatresults in the desired library size (e.g., based upon the screeningthroughput and expected fitness of the new library) (H5). Further, themethod include estimating median fitness and variance of libraries as afunction of the timescale from which the diversity comes (H6), andmaking a series of libraries covering the desired median fitness andvariance range (H7).

All of these methods can be implemented for an entire alignment and/orfor a specific user defined set of residues or using structuralinformation to make libraries of domains (modules, sub-domains, etc.).For diversity generation, these matrices-based approaches can be used inconjunction with other methods like PCA, PLS, PCR, MLR, or the like,where load information (e.g., site entropies) on specific sites of theprotein can attach significance to substitution possibilities.Information from consensus sequences can be used to restrict or increasediversity in the library. Ancestral sequence reconstruction methods canreliably identify changes that took place in the set of proteins veryearly on in the evolutionary process, and changes that are adaptive innature. This can be automatically used in the approaches describedherein to make desired libraries.

These methods typically include various selection stringencies andlibraries sizes. For example, assessments of the “fragility” of aprotein are optionally made by estimations. Such estimations aretypically governed by model studies of protein folding (e.g., already inthe literature, etc.), empirical data (e.g., screen about 100-1000 hitsper library, etc.), extrapolations from the rate of changes inevolution, size of library that can be screened, and/or the like.Libraries typically include between about 10³ and about 10¹² members,depending upon the particular screening methods utilized. For example,one should consider the correlation of the screen with downstream highercomplexity screens.

These methods for high efficiency sequence space searches provide manydifferent advantages. In particular, the general approach becomes morepowerful and refined as data on proteins/folds of interest accumulates.Also, desired sequence space can be automatically defined fromphylogenetic data using a computer. In addition, phylogeneticinformation about “safe” steps (e.g., conservative residuesubstitutions) can be harnessed for subsequent analysis and development.

In certain aspects, the present invention provides a system forproducing libraries of desired sizes. The system includes (a) at leastone computer that includes a database capable of storing sets ofbiopolymer character strings, and (b) system software. The systemsoftware includes one or more logic instructions for: (i) identifyingone or more homologues of at least one initial polypeptide sequence,(ii) comparing the sequences of the homologue(s) and the initialpolypeptide; (iii) identifying variable amino acid residues, whereinvariable amino acid residues differ with respect to amino acid residuetype at corresponding positions in the sequences of the homologue(s) andthe initial polypeptide sequence; (iv) identifying a set ofevolutionarily conserved variable amino acid residues; and (v)generating a library of protein variants incorporating the set ofevolutionarily conserved variable amino acid residues. The systemsoftware also includes instructions for (iv) identifying variablemonomer positions in the at least one initial biopolymer characterstring from the selected evolutionary timescale that result in a desiredlibrary size, and (v) providing a series of libraries that comprise aselected median fitness and variance range.

The invention also includes a computer program product for producinglibraries of desired sizes. The computer program product includes acomputer readable medium having one or more logic instructions for: (a)identifying one or more homologues of at least one initial biopolymercharacter string from a selected evolutionary timescale, (b) plotting anumber of monomer changes for the at least one initial biopolymercharacter string against a time/probability, and (c) plotting potentiallibrary size against the time/probability. The computer program productalso includes instructions for (d) identifying variable monomerpositions in the at least one initial biopolymer character string fromthe selected evolutionary timescale that result in a desired librarysize, and (e) providing a series of libraries that comprise a selectedmedian fitness and variance range.

IV. SEQUENCE ACTIVITY PREDICTIONS

A. Use of Neural Networks to Identify DNA or Protein Sequences withImproved Characteristics

In the present invention neural networks are used to analyze dataderived from various artificial evolution processes, including DNAshuffling, to predict sequences that have improved characteristics. Inone example, such neural networks may be used in genetic algorithms tooptimize sequences for further protein variant libraries. In brief, themethods include using data from each round of, e.g., a shufflingprocedure as a training set for a neural network. Once a neural networkhas been trained, character string sequences can be “assayed” in silicousing the trained network. Sequences which the network identifies ashaving improved characteristics are then typically added to subsequentrounds of shuffling, or synthesized de novo. Scoring systems used torate these newly predicted character string sequences optionally takeinto account not only the neural network predicted score, but also ascore of how many derivative character string sequences (e.g., characterstring variants of the newly predicted character string sequences) alsohave a high neural network score. For example, if character stringsequence A was mutated into 1000 character string variants, and eachvariant was scored according to the network, the percentage of characterstring variants that score above a certain cutoff in the neural networkare optionally counted. Further, this data may be combined with theneural network score of character string sequence A to produce a finalscore. Such a score would represent not only what the network predictedfor that sequence, but also how probable that sequence is to mutate intoas good or better sequences.

To further illustrate, FIG. 13 provides a chart that shows certain stepsperformed in an embodiment of a method of predicting character stringsthat include desired properties. As shown, the methods include evolvingat least one parental character string (e.g., a plurality of parentalcharacter strings, etc.) using at least one artificial evolutionprocedure to produce at least one population of artificially evolvedcharacter strings (I1). Artificial evolution procedures carried out oncharacter strings are typically performed reiteratively to producemultiple populations of artificially evolved character strings, whichmultiple populations of artificially evolved character strings are usedto train the neural network. The methods also include selecting orscreening the population of artificially evolved character strings forat least one desired property (e.g., a physical property, a catalyticproperty, or the like that is improved property relative to the parentalcharacter string) to produce a population of selected artificiallyevolved character strings (I2). The methods also include training aneural network with the population of selected artificially evolvedcharacter strings to produce a trained neural network (I3). Thereafter,the methods include predicting character strings that include, or arelikely to include, the desired property using the trained neural network(I4). Additional details relating to neural networks are provided above.

In certain embodiments, the methods further include repeating steps I1and I2 using the population of selected artificially evolved characterstrings in step I2 as the at least one parental character string in arepeated step I1. In these embodiments, the methods optionally furtherinclude using the population of selected artificially evolved characterstrings from at least one repeated step I2 to further train the neuralnetwork in step I3. Parental character strings typically corresponds topolynucleotides or polypeptides. In some embodiments, the methodsoptionally further include synthesizing polynucleotides or polypeptidesthat correspond to the character strings predicted in step I4. In otherembodiments, the methods further include repeating steps I1-I4 using atleast one of the character strings predicted in step I4 as a parentalcharacter string in a repeated step I4. Typically, the methods furtherinclude using the trained neural network as a filter to bias libraryproduction toward active library members.

In particular, step I4 typically includes scoring multiple characterstrings using a scoring system of the trained neural network to predictthe character strings with the desired property. The scoring systemgenerally ranks scored character strings. In addition, the scoringsystem typically accounts for a number of progeny character strings fromeach character string that includes a score above a selected score. Forexample, the number of progeny character strings typically includes,e.g., between about two and about 10⁵ progeny character strings.Generally, the scoring system combines each character string score witheach corresponding progeny character string score to produce a finalscore. The final score provides a measure of a probability of thecharacter strings mutating into progeny character strings that areimproved relative to the character strings.

The artificial evolution procedures used in step I1 are optionallyperformed in silico and accordingly, typically include applying geneticoperators to parental character strings to produce the population ofartificially evolved character strings. Exemplary genetic operatorsoptionally used in these methods include, e.g., a mutation of the atleast one parental character string or substrings of the at least oneparental character string, a multiplication of the at least one parentalcharacter string or substrings of the at least one parental characterstring, a fragmentation of the at least one parental character stringinto substrings, a crossover between parental character strings orsubstrings of the parental character strings, a ligation of parentalcharacter strings or substrings of the parental character strings, anelitism calculation, a calculation of sequence homology or sequencesimilarity of an alignment comprising parental character strings, arecursive use of at least one of the one or more genetic operators, anapplication of a randomness operator to the at least one parentalcharacter string or substrings of the at least one parental characterstring, a deletion mutation of one or more parental character strings orsubstrings of the one or more parental character strings, an insertionmutation into the at least one parental character string or substringsof the parental character string, a subtraction of parental characterstrings with inactive sequences, a selection of parental characterstrings with active sequences, a death of parental character strings orsubstrings of the parental character strings, or the like.

The invention also provides a computer system for predicting characterstrings that include desired properties. The system includes (a) acomputer system that includes a neural network and a database capable ofstoring character strings, and (b) system software. The system softwareincludes one or more logic instructions for (i) evolving at least oneparental character string using at least one artificial evolutionprocedure to produce at least one population of artificially evolvedcharacter strings, and (ii) selecting or screening the population ofartificially evolved character strings for at least one desired propertyto produce a population of selected artificially evolved characterstrings. The software also includes instructions for (iii) training theneural network with the population of selected artificially evolvedcharacter strings to produce a trained neural network, and (iv)predicting one or more character strings that comprise the at least onedesired property using the trained neural network.

In another aspect, the invention relates to a computer program productfor predicting character strings that include desired properties. Thecomputer program product includes a computer readable medium having oneor more logic instructions for (a) evolving at least one parentalcharacter string using at least one artificial evolution procedure toproduce at least one population of artificially evolved characterstrings, and (b) selecting or screening the population of artificiallyevolved character strings for at least one desired property to produce apopulation of selected artificially evolved character strings. Theproduct also includes instructions for (c) training a neural networkwith the population of selected artificially evolved character stringsto produce a trained neural network, and (d) predicting one or morecharacter strings that comprise the at least one desired property usingthe trained neural network. Systems and software are described furtherherein.

B. Use of Pattern of Motif Finding Algorithms to Analyze Sequence Space

There are many computer programs available for searching and finding andmotifs within a group of sequences. Typically, these programs arelimited to characterizing sequences as part of a broad protein family ornot. In the present invention, motif finding programs are used tocharacterize and predict the activity of proteins, e.g., artificiallyevolved proteins. For example, positive sequences (e.g., those having adesired level of fitness), negative sequences (e.g., those lacking adesired level of fitness), and parents are optionally entered intopattern finding programs separately. However, all types of sequences areoptionally entered into the pattern finding program together, e.g., toincrease the sensitivity to finding any patterns. Due to the generallyhigher homology of positive sequences, motif finding programs typicallyfind many motifs or patterns that exist within each sequence group.Patterns are optionally scored according to a frequency of occurrence ineach group, to a frequency of absence from each sequence group, and/orthe like. Additionally, detected patterns are also optionally enteredinto another pattern recognition algorithm such as a neural network.Once pattern recognition and scoring are complete, hypotheticalsequences are scored in order to find additional sequences that will orare more likely to have the desired activity/property. Further, PCAanalysis is optionally performed on pattern finding results to determineif there are combinations of motifs or patterns that are predictive ofactivity, which are then used to score additional protein sequences.These methods are typically implemented in web- or other software-basedembodiments, and optionally coupled with additional bioinformaticsanalysis tools, such as crossover analysis, shuffling analysis, oligocreation, structural analysis, etc. in order to sell molecular biologykits for shuffling, selling oligos, or other bioinformatics software orservices.

In certain embodiments, search trees are generated, which are, e.g.,based on a scoring method in order to organize patterns, or groups ofpatterns in such a way to permit traversing the tree instead of tryingall possible patterns, and combination of patterns. For example,patterns are optionally scored by how often they show up inpositive/negative sequences. Instead of individual patterns, PCAanalysis or the like is optionally performed to determine combinationsof patterns for each of the nodes. To illustrate, the results ofsearching patterns on the positive and negative sequences are optionallyanalyzed using PCA. A load cutoff value is typically used for eachprincipal component and a resulting pattern (e.g., a list of patterns)would then correspond to the nodes of the tree.

In addition, patterns are optionally scored with a value that relates,e.g., to relative information content, importance, fitness etc. as wellas a value of predicted activity. These are optionally used again totrain neural networks or to build a decision tree to rank or scorehypothetical proteins or other biopolymers. For example, if the patternAAA.GAW is found to be the most important, then hypothetical proteinsare typically checked on the basis of whether they have the next mostimportant pattern in that sub-branch. This process is optionallycontinued on with the next most important pattern given, e.g., that thefirst one was found or not found, and classify the sequence based onthat sequence. The “contains” and “does not contain” sub-trees mayinclude similar nodes (i.e., patterns), or they may not depending on howimportant a particular pattern is given its parent node lineage. Tofurther illustrate, FIG. 14 schematically shows an exampleorganizational tree. In the example, if a pattern has the three patternsAAA.GAW, AAA.G.W.W, and GPPW, then its probability of having the desiredactivity is 60%. Further, it might be based on the fact that 60% of thepositive sequences have these three patterns.

FIG. 15 is a chart that depicts certain steps performed in oneembodiment of the methods of predicting properties of target polypeptide character strings (e.g., at least one hypothetical polypeptidecharacter string, etc.). As shown, the methods include identifying oneor more motifs common to two or more members of a population ofpolypeptide character string variants in which at least a subset of thepopulation of polyp eptide character string variants includes the atleast one property (e.g., a functional property, a structural property,and/or the like), to produce a motif data set (J1). In certainembodiments, a phylogenetic family includes the polypeptide characterstring variants. At least one of the one or more motifs typicallyincludes one or more character substrings. Typically, the at least onetarget polypeptide includes a population of target polypeptide characterstrings. In these embodiments, the population of target polyp eptidecharacter strings is generally produced by one or more artificialevolution procedures. The methods also include J2 correlating at leastone motif from the motif data set with the at least one property toproduce a motif scoring function, and J3 scoring the at least one targetpolypeptide character string using the motif scoring function to predictthe at least one property of the at least one target polyp eptidecharacter string. At least one step of these methods is typicallyperformed in a digital or web-based system. Optionally, the methodsfurther include synthesizing a polypeptide corresponding to the targetpolypeptide character string. An additional option includes subjectingthe polyp eptide, or a polynucleotide that encodes the polypeptide, oneor more artificial evolution procedures.

Motif scoring functions are produced using variations techniques. Forexample, step J2 optionally includes scoring the motifs or combinationsof the motifs according frequencies of occurrence in positivepolypeptide character string variants or negative polypeptide characterstring variants to produce the motif scoring function. In someembodiments, step J2 includes scoring the motifs, or combinations of themotifs, with a value relating to relative information content and/orrelative fitness. In other embodiments, step J2 includes scoring themotifs, or combinations of the motifs, with values relating to relativepredictive activity. In still other embodiments, step J2 includesdetermining a number of times the one or more motifs occur in or areabsent from the two or more members of the population of polypeptidecharacter string variants.

In certain embodiments, the population of polypeptide character stringvariants includes one or more polypeptide character string variantgroups. Each polypeptide character string variant group optionallyincludes, e.g., positive polyp eptide character string variants,negative polyp eptide character string variants, and/or parentalpolypeptide character string variants. The polypeptide character stringvariants are typically produced by, or correspond to polypeptidesproduced by, one or more artificial evolution procedures. At least one(and typically more than one) step of the one or more artificialevolution techniques is generally performed in silico.

In preferred embodiments, at least step J1 is performed in at least onelogic device that includes at least one first motif recognitionalgorithm, which first motif recognition algorithm identifies the one ormore motifs. Typically, each method step is performed in the at leastone logic device. Optionally, the methods further include producing atleast one classification tree (e.g., at least one classification andregression tree (CART), etc.) to organize the motifs of the motif dataset. For example, the at least one classification tree typically permitssearching the motif data set without trying all of the motifs orcombinations of motifs in the motif data set.

In some embodiments, the methods further include performing principalcomponent analysis on the motif data set to identify one or morecombinations of motifs that are predictive of the at least one desiredproperty. Optionally, the methods further include performing a partialleast squares analysis on the motif data set to identify one or morecombinations of motifs that are predictive of the desired property. Theone or more identified combinations of motifs are typically used tofurther refine the motif scoring function. In addition, the methodsoptionally further include producing at least one classification tree(e.g., at least one classification and regression tree, etc.) toorganize the one or more combinations of motifs. In these embodiments,the one or more combinations of motifs typically include nodes in the atleast one classification tree. Typically, the at least oneclassification tree permits searching the motif data set without tryingall of the motifs or combinations of motifs in the motif data set. Incertain other embodiments, the methods further include subjecting themotif data set to at least one second pattern recognition algorithm,which second pattern recognition algorithm identifies at least oneadditional motif common to at least two members of the population ofpolypeptide character string variants. For example, the second patternrecognition algorithm optionally includes a neural network. Neuralnetworks are described further herein.

The invention also provides a system for predicting at least oneproperty of at least one target polypeptide character string. The systemincludes (a) at least one computer that includes a database capable ofstoring character strings, and (b) system software. The system softwareincludes one or more logic instructions for (i) identifying one or moremotifs common to two or more members of a population of polypeptidecharacter string variants, wherein at least a subset of the populationof polyp eptide character string variants comprises the at least oneproperty, to produce a motif data set. The software also includesinstructions for (ii) correlating at least one motif from the motif dataset with the at least one property to produce a motif scoring function,and (iii) scoring the at least one target polypeptide character stringusing the motif scoring function to predict the at least one property ofthe at least one target polyp eptide character string.

In addition, the invention also relates to a computer program productfor predicting at least one property of at least one target polypeptidecharacter string. The computer program product includes a computerreadable medium having one or more logic instructions for (a)identifying one or more motifs common to two or more members of apopulation of polypeptide character string variants, wherein at least asubset of the population of polypeptide character string variantscomprises the at least one property, to produce a motif data set. Thecomputer program product also includes instructions for (b) correlatingat least one motif from the motif data set with the at least oneproperty to produce a motif scoring function, and (c) scoring the atleast one target polypeptide character string using the motif scoringfunction to predict the at least one property of the at least one targetpolypeptide character string.

C. In Silico Directed Evolution with Functional Screening Using PCA andNeural Networks

In certain embodiments, at least one member of the set of parentalcharacter strings is obtained from at least one database. In some ofthese embodiments, the at least one member includes substantially allcharacter strings available from the database. Typically, at least onemember of the set of parental character strings is produced by, orcorresponds to at least one polynucleotide or at least one polypeptideproduced by, one or more artificial evolution procedures. At least onestep of the artificial evolution procedures is typically performed insilico. In some embodiments, the set of parental character stringscorresponds to a set of parental polynucleotides or polypeptides.

The invention also provides a system for assigning an activity to acharacter string. The system includes (a) at least one computer thatincludes a database capable of storing character strings, and (b) systemsoftware. The system software includes one or more logic instructionsfor (i) selecting a set of parental character strings for at least oneactivity to produce a set of selected parental character strings, and(ii) subjecting the set of selected parental character strings to one ormore artificial evolution procedures to produce a set of evolvedcharacter strings. The system software also includes instructions for(iii) selecting the set of evolved character strings for the at leastone activity to produce a set of selected evolved character strings,(iv) providing a sequence-activity plot for the set of character stringvariants, and (v) predicting at least one activity of one or morecharacter strings from the sequence-activity plot.

In addition, the invention provides a computer program product forpredicting character string activities. The computer program productincludes a computer readable medium having one or more logicinstructions for (a) selecting a set of parental character strings forat least one activity to produce a set of selected parental characterstrings, and (b) subjecting the set of selected parental characterstrings to one or more artificial evolution procedures to produce a setof evolved character strings. The product also includes instructions for(c) selecting the set of evolved character strings for the at least oneactivity to produce a set of selected evolved character strings, (d)providing a sequence-activity plot for the set of character stringvariants, and (e) predicting at least one activity of one or morecharacter strings from the sequence-activity plot.

V. EXPERIMENTAL TECHNIQUES

A. Protein Variant Libraries

Libraries of protein variants can be generated using any of a variety ofmethods that are well known to those having ordinary skill in the art.These libraries are typically prepared by expression, either in vivo orin vitro, of a library of diverse polynucleotides. Libraries of diversepolynucleotides can be generated by application of a “diversitygenerating procedure” to one or more “parental” polynucleotides.

As used herein, the term “diversity generating procedure” refers to amethod that modifies the sequence of a parental polynucleotide, andconcomitantly the polypeptide it encodes, thereby generating a libraryof polynucleotide variants that differ from each other with respect tosequence. Diversity generating procedures that are suitable for use inthe practice of the present invention include either mutagenesis andrecombination-based methods, or a combination of both. Expression of theresulting polynucleotide variant library thus generates a library ofpolypeptide variants.

Protein variant libraries employed in the practice of the presentinvention may be made in a “blind” fashion, where the protein variantmolecules are generated without prior knowledge of their amino acidsequences (i.e., where the polynucleotide variant sequences are notknown prior to expression into a protein variant library).Alternatively, the amino acid sequences encoding the protein variantsmay be designed a priori, followed by the step of actually making thephysical molecules using methods known to those having ordinary skill inthe art. These methods include expression of polynucleotides generatedby, for example, gene synthesis via ligation and/or polymerase-mediatedoligonucleotide assembly and mutagenesis of a parental polynucleotide,using methods known in the art. Suitable methods for designing aminoacid sequences of systematically varied protein variants include designof experiment methods (DOE), described in more detail herein.

Polynucleotide mutagenesis is a suitable method for generating theprotein variants employed in the practice of the present invention. Suchmethods include, for example, error prone polymerase chain reaction(PCR), site-specific mutagenesis, cassette-mutagenesis, in vivomutagenesis methods, and the like. In error-prone PCR, PCR is performedunder conditions where the copying fidelity of the DNA polymerase islow, such that a high rate of point mutations is obtained along theentire length of the PCR product. See e.g., Leung et al. (1989)Technique 1:11-15 and Caldwell et al. (1992) PCR Methods Applic.2:28-33. Site-specific mutations can be introduced in a polynucleotidesequence of interest using oligonucleotide-directed mutagenesis. SeeReidhaar-Olson et al. (1988) Science, 241:53-57. Similarly, cassettemutagenesis can be used in a process that replaces a small region of adouble stranded DNA molecule with a synthetic oligonucleotide cassettethat differs from the native sequence. In vivo mutagenesis can be usedto generate random mutations in any cloned DNA of interest bypropagating the DNA in a host cell strain prone to generating mutations,e.g., in a strain of E. coli that carries mutations in one or more ofthe DNA repair pathways. These “mutator” strains have a higher randommutation rate than that of a wild-type parent. Propagating the DNA inone of these strains will eventually generate random mutations withinthe DNA. Mutagenesis methods are generally well known to those havingordinary skill in the art and are extensively described elsewhere. Seee.g., Kramer et al. (1984) Cell 38:879-887; Carter et al. (1985) Nucl.Acids Res. 13: 4431-4443; Carter (1987) Methods in Enzymol. 154:382-403; Eghtedarzadeh & Henikoff (1986) Nucl. Acids Res. 14: 5115;Wellset al. (1986) Phil. Trans. R. Soc. Lond. A 317: 415-423; Nambiar et al.(1984) Science 223: 1299-1301; Sakamar and Khorana (1988) Nucl. AcidsRes. 14: 6361-6372; Wells et al. (1985) Gene 34:315-323; Grundström etal. (1985) Nucl. Acids Res. 13: 3305-3316; Mandecki (1986) Proc. Natl.Acad. Sci. USA, 83:7177-7181; Arnold (1993) Current Opinion inBiotechnology 4:450-455); Anal Biochem. 254(2): 157-178; Dale et al.(1996) Methods Mol. Biol. 57:369-374; Smith (1985) Ann. Rev. Genet.19:423-462; Botstein & Shortle (1985) Science 229:1193-1201; Carter(1986) Biochem. J. 237:1-7; Kunkel (1987) in Nucleic Acids & MolecularBiology, Eckstein, F. and Lilley, D. M. J. eds., Springer Verlag,Berlin; Kunkel (1985) Proc. Natl. Acad. Sci. USA 82:488-492; Kunkel etal. (1987) Methods in Enzymol. 154, 367-382; and Bass et al. (1988)Science 242:240-245; Methods in Enzymol. 100: 468-500 (1983); Methods inEnzymol. 154: 329-350 (1987); Zoller & Smith (1982) Nucleic Acids Res.10:6487-6500; Zoller & Smith (1983) Methods in Enzvmol. 100:468-500; andZoller & Smith (1987) Methods in Enzymol. 154:329-350); Taylor et al.(1985) Nucl. Acids Res. 13: 8749-8764; Taylor et al. (1985) Nucl. AcidsRes. 13: 8765-8787 (1985); Nakamaye & Eckstein (1986) Nucl. Acids Res.14: 9679-9698; Sayers et al. (1988) Nucl. Acids Res. 16:791-802; Sayerset al. (1988) Nucl. Acids Res. 16: 803-814); Kramer et al. (1984) Nucl.Acids Res. 12: 9441-9456; Kramer & Fritz (1987) Methods in Enzymol.154:350-367; Kramer et al. (1988) Nucl. Acids Res. 16: 7207; and Fritzet al. (1988) Nucl. Acids Res. 16: 6987-6999.

Kits for mutagenesis, library construction and other diversitygeneration methods are commercially available. For example, kits areavailable from, e.g., Stratagene (e.g., QuickChange™ site-directedmutagenesis kit; and Chameleon™ double-stranded, site-directedmutagenesis kit), Bio/Can Scientific, Bio-Rad (e.g., using the Kunkelmethod referenced above), Boehringer Mannheim Corp., ClonetechLaboratories, DNA Technologies, Epicentre Technologies (e.g., 5 prime 3prime kit); Genpak Inc, Lemargo Inc, Life Technologies (Gibco BRL), NewEngland Biolabs, Pharmacia Biotech, Promega Corp., QuantumBiotechnologies, Amersham International plc (e.g., using the Ecksteinmethod referenced above), and Anglian Biotechnology Ltd. (e.g., usingthe Carter/Winter method referenced above).

Recombination-based methods are also suitable for generating a diverselibrary of polynucleotide variants that can be expressed to generate aprotein variant library. These methods are also referred to as DNAshuffling. In these methods, polynucleotides are recombined, either invitro or in vivo, to generate a library of polynucleotide variants. Inrecombination-based methods, DNA fragments, PCR amplicons, and/orsynthetic oligonucleotides that collectively correspond in sequence tosome or all of the sequence of one or more parental polynucleotides arerecombined to generate a library of polynucleotide variants of theparental polynucleotide(s). The recombination process may be mediated byhybridization of the DNA fragments, PCR amplicons, and/or syntheticoligonucleotides to each other (e.g., as partially overlappingduplexes), or to a larger piece of DNA, such as a full-length template.Depending on the recombination format employed, ligase and/or polymerasemay be used to facilitate the construction of a full-lengthpolynucleotide. PCR cycling is typically used in formats employing onlya polymerase. These methods are generally known to those having ordinaryskill in the art and are described extensively elsewhere. See e.g.,Soong, N. et al. (2000) Nat. Genet. 25(4):436-439; Stemmer, et al.(1999) Tumor Targeting 4:1-4; Ness et al. (1999) Nature Biotechnology17:893-896; Chang et al. (1999) Nature Biotechnology 17:793-797;Minshull and Stemmer (1999) Current Opinion in Chemical Biology3:284-290; Christians et al. (1999) Nature Biotechnology 17:259-264;Crameri et al. (1998) Nature 391:288-291; Crameri et al. (1997) NatureBiotechnology 15:436-438; Zhang et al. (1997) Proc. Natl. Acad. Sci. USA94:4504-4509; Patten et al. (1997) Current Opinion in Biotechnology8:724-733; Crameri et al. (1996) Nature Medicine 2:100-103; Crameri etal. (1996) Nature Biotechnology 14:315-319; Gates et al. (1996) Journalof Molecular Biology 255:373-386; Stemmer (1996) In: The Encyclopedia ofMolecular Biology. VCH Publishers, New York. pp. 447-457; Crameri andStemmer (1995) BioTechniques 18:194-195; Stemmer et al., (1995) Gene,164:49-53; Stemmer (1995) “The Evolution of Molecular Computation”Science 270: 1510; Stemmer (1995) Bio/Technology 13:549-553; Stemmer(1994) Nature 370:389-391; and Stemmer (1994) Proc. Natl. Acad. Sci. USA91:10747-10751; Giver and Arnold (1998) Current Opinion in ChemicalBiology 2:335-338; Zhao et al. (1998) Nature Biotechnology 16:258-261;Coco et al. (2001) Nature Biotechnology 19:354-359; U.S. Pat. Nos.5,605,793, 5,811,238, 5,830,721, 5,834,252, 5,837,458, WO 95/22625, WO96/33207, WO 97/20078, WO 97/35966, WO 99/41402, WO 99/41383, WO99/41369, WO 99/41368, WO 99/23107,WO 99/21979, WO 98/31837, WO98/27230, WO 98/27230, WO 00/00632, WO 00/09679, WO 98/42832,WO99/29902, WO 98/41653, WO 98/41622, and WO 98/42727, WO 00/18906, WO00/04190, WO 00/42561, WO 00/42559, WO 00/42560, WO 01/23401, WO00/20573, WO 01/29211, WO 00/46344, and WO 01/29212.

Parental polynucleotides employed in the recombination processesreference above may be either wildtype polynucleotides or non-naturallyoccurring polynucleotides. In one embodiment of the present invention,protein variants having systematically varied sequences are prepared byrecombination of two or more parental polynucleotides followed byexpression. In some embodiments, the parental polynucleotides aremembers of a single gene family. As used herein, the term “gene family”refers to a set of genes that encode polypeptides which exhibit the sametype, although not necessarily the same degree, of an activity.

Polynucleic acids can be recombined in vitro by any of a variety oftechniques , including e.g., DNAse digestion of nucleic acids to berecombined followed by ligation and/or PCR reassembly of the nucleicacids. For example, sexual PCR mutagenesis can be used in which random(or pseudo random, or even non-random) fragmentation of the DNA moleculeis followed by recombination, based on sequence similarity, between DNAmolecules with different but related DNA sequences, in vitro, followedby fixation of the crossover by extension in a polymerase chainreaction. This process and many process variants is described, e.g., inStemmer (1994) Proc. Natl. Acad. Sci. USA 91:10747-10751.

Synthetic recombination methods can also be used, in whicholigonucleotides corresponding to targets of interest are chemicallysynthesized and reassembled in PCR or ligation reactions which includeoligonucleotides that correspond to more than one parentalpolynucleotide, thereby generating new recombined polynucleotides.Oligonucleotides can be made by standard nucleotide addition methods, orcan be made, e.g., by tri-nucleotide synthetic approaches. Detailsregarding such approaches are found in the references noted above, e.g.,WO 00/42561 by Crameri et al., “Olgonucleotide Mediated Nucleic AcidRecombination;” WO 01/23401 by Welch et al., “Use of Codon-VariedOligonucleotide Synthesis for Synthetic Shuffling;” WO 00/42560 bySelifonov et al., “Methods for Making Character Strings, Polynucleotidesand Polypeptides Having Desired Characteristics;” and WO 00/42559 bySelifonov and Stemmer “Methods of Populating Data Structures for Use inEvolutionary Simulations.”

Polynucleotides can also be recombined in vivo, e.g., by allowingrecombination to occur between nucleic acids in cells. Many such in vivorecombination formats are set forth in the references noted above. Suchformats optionally provide direct recombination between nucleic acids ofinterest, or provide recombination between vectors, viruses, plasmids,etc., comprising the nucleic acids of interest, as well as otherformats. Details regarding such procedures are found in the referencescited herein.

Many methods of accessing natural diversity, e.g., by hybridization ofdiverse nucleic acids or nucleic acid fragments to single-strandedtemplates, followed by polymerization and/or ligation to regeneratefull-length sequences, optionally followed by degradation of thetemplates and recovery of the resulting modified nucleic acids can besimilarly used. These methods can be used in physical systems or can beperformed in computer systems according to specific embodiments of theinvention. In one method employing a single-stranded template, thefragment population derived from the genomic library(ies) is annealedwith partial, or, often approximately full length ssDNA or RNAcorresponding to the opposite strand. Assembly of complex chimeric genesfrom this population is then mediated by nuclease-base removal ofnon-hybridizing fragment ends, polymerization to fill gaps between suchfragments and subsequent single stranded ligation. The parentalpolynucleotide strand can be removed by digestion (e.g., if RNA oruracil-containing), magnetic separation under denaturing conditions (iflabeled in a manner conducive to such separation) and other availableseparation/purification methods. Alternatively, the parental strand isoptionally co-purified with the chimeric strands and removed duringsubsequent screening and processing steps. Additional details regardingthis approach are found, e.g., in “Single-Stranded Nucleic AcidTemplate-Mediated Recombination and Nucleic Acid Fragment Isolation” byAffholter, WO 01/64864.

Methods of recombination can also be performed digitally on aninformation processing system. For example, algorithms can be used in acomputer to recombine sequence strings that correspond to homologous (oreven non-homologous) bio-molecules. According to specific embodiments ofthe invention, after processing in a computer system, the resultingsequence strings can be converted into nucleic acids by synthesis ofnucleic acids which correspond to the recombined sequences, e.g., inconcert with oligonucleotide synthesis/gene reassembly techniques. Thisapproach can generate random, partially random, or designed variants.Many details regarding various embodiments of computer enabledrecombination, including the use of various algorithms, operators andthe like in computer systems, as well as combinations of designednucleic acids and/or proteins (e.g., based on cross-over site selection)as well as designed, pseudo-random or random recombination methods aredescribed in WO 00/42560 by Selifonov et al., “Methods for MakingCharacter Strings, Polynucleotides and Polypeptides Having DesiredCharacteristics,” WO 01/75767 by Gustafsson et al., “In SilicoCross-Over Site Selection,” and WO 00/42559 by Selifonov and Stemmer“Methods of Populating Data Structures for Use in EvolutionarySimulations.”

B. Directed Evolution

Directed evolution (or alternatively “artificial evolution”) can becarried out by practicing one or more diversity generating methods in areiterative fashion coupled with screening (described in more detailelsewhere herein) to generate a further set of recombinant nucleicacids. Thus, directed or artificial evolution can be carried out byrepeated cycles of mutagenesis and/or recombination and screening. Forexample, mutagenesis and/or recombination can be carried out on parentalpolynucleotides to generate a library of variant polynucleotides thatare then expressed to generate a protein variant library that is screenfor a desired activity. One or more variant proteins may be identifiedfrom the protein variant library as exhibiting improvement in thedesired activity. The identified proteins can be reverse translated toascertain one or more polynucleotide sequences that encode theidentified protein variants, which in turn can be mutated or recombinedin a subsequent round of diversity generation and screening.

Directed evolution using recombination-based formats of diversitygeneration is described extensively in the references cited herein.Directed evolution using mutagenesis as the basis for diversitygeneration is also well known in the art. For example, recursiveensemble mutagenesis is a process in which an algorithm for proteinmutagenesis is used to produce diverse populations of phenotypicallyrelated mutants, members of which differ in amino acid sequence. Thismethod uses a feedback mechanism to monitor successive rounds ofcombinatorial cassette mutagenesis. Examples of this approach are foundin Arkin & Youvan (1992) Proc. Natl. Acad. Sci. USA 89:7811-7815.Similarly, exponential ensemble mutagenesis can be used for generatingcombinatorial libraries with a high percentage of unique and functionalmutants. Small groups of residues in a sequence of interest arerandomized in parallel to identify, at each altered position, aminoacids which lead to functional proteins. Examples of such procedures arefound in Delegrave & Youvan (1993) Biotechnology Research 11:1548-1552.

Structure-activity models of the present invention are useful inoptimizing the directed evolution process regardless of the diversitygenerating procedure employed. Information derived from application ofthe invention models can be used to more intelligently design librariesmade in a directed evolution process. For example, where it is desiredto toggle or fix residues at certain amino acid residue positions,synthetic oligonucleotides incorporating the codons encoding thosedesired amino acid residues can be used in one of the recombinationformats referred to herein to generate a polynucleotide variant librarythat can then be expressed. Alternatively, the desired residues can beincorporated using one of the various mutagenesis methods describedherein. In any event, the resulting protein variant library will thuscontain protein variants that incorporate what are believed to bebeneficial residues or potentially beneficial residues. This process canbe repeated until a protein variant having the desired activity isidentified.

C. Screening/Selection for Activity

Polynucleotides generated in connection with methods of the presentinvention are optionally cloned into cells for activity screening (orused in in vitro transcription reactions to make products which arescreened). Furthermore, the nucleic acids can be enriched, sequenced,expressed, amplified in vitro or treated in any other common recombinantmethod.

General texts that describe molecular biological techniques usefulherein, including cloning, mutagenesis, library construction, screeningassays, cell culture and the like include Berger and Kimmel, Guide toMolecular Cloning Techniques, Methods in Enzymology volume 152 AcademicPress, Inc., San Diego, Calif. (Berger); Sambrook et al., MolecularCloning—A Laboratory Manual (2nd Ed.), Vol. 1-3, Cold Spring HarborLaboratory, Cold Spring Harbor, N.Y., 1989 (Sambrook) and CurrentProtocols in Molecular Biology, F.M. Ausubel et al., eds., CurrentProtocols, a joint venture between Greene Publishing Associates, Inc.and John Wiley & Sons, Inc., New York (supplemented through 2000)(Ausubel)). Methods of transducing cells, including plant and animalcells, with nucleic acids are generally available, as are methods ofexpressing proteins encoded by such nucleic acids. In addition toBerger, Ausubel and Sambrook, useful general references for culture ofanimal cells include Freshney (Culture of Animal Cells, a Manual ofBasic Technique, third edition Wiley-Liss, New York (1994)) and thereferences cited therein, Humason (Animal Tissue Techniques, fourthedition W.H. Freeman and Company (1979)) and Ricciardelli, et al., InVitro Cell Dev. Biol. 25:1016-1024 (1989). References for plant cellcloning, culture and regeneration include Payne et al. (1992) Plant Celland Tissue Culture in Liquid Systems John Wiley & Sons, Inc. New York,N.Y. (Payne); and Gamborg and Phillips (eds) (1995) Plant Cell, Tissueand Organ Culture; Fundamental Methods Springer Lab Manual,Springer-Verlag (Berlin Heidelberg New York) (Gamborg). A variety ofCell culture media are described in Atlas and Parks (eds) The Handbookof Microbiological Media (1993) CRC Press, Boca Raton, Fla. (Atlas).Additional information for plant cell culture is found in availablecommercial literature such as the Life Science Research Cell CultureCatalogue (1998) from Sigma-Aldrich, Inc (St Louis, Mo.) (Sigma-LSRCCC)and, e.g., the Plant Culture Catalogue and supplement (1997) also fromSigma-Aldrich, Inc (St Louis, Mo.) (Sigma-PCCS).

Examples of techniques sufficient to direct persons of skill through invitro amplification methods, useful e.g., for amplifying oligonucleotiderecombined nucleic acids including polymerase chain reactions (PCR),ligase chain reactions (LCR), Qβ-replicase amplifications and other RNApolymerase mediated techniques (e.g., NASBA). These techniques are foundin Berger, Sambrook, and Ausubel, supra, as well as in Mullis et al.,(1987) U.S. Pat. No. 4,683,202; PCR Protocols A Guide to Methods andApplications (Innis et al. eds) Academic Press Inc. San Diego, Calif.(1990) (Innis); Arnheim & Levinson (Oct. 1, 1990) C&EN 36-47; TheJournal Of NIH Research (1991) 3, 81-94; Kwoh et al. (1989) Proc. Natl.Acad. Sci. USA 86, 1173; Guatelli et al. (1990) Proc. Natl. Acad. Sci.USA 87, 1874; Lomell et al. (1989) J. Clin. Chem 35, 1826; Landegren etal., (1988) Science 241, 1077-1080; Van Brunt (1990) Biotechnology 8,291-294; Wu and Wallace, (1989) Gene 4, 560; Barringer et al. (1990)Gene 89, 117, and Sooknanan and Malek (1995) Biotechnology 13: 563-564.Improved methods of cloning in vitro amplified nucleic acids aredescribed in Wallace et al., U.S. Pat. No. 5,426,039. Improved methodsof amplifying large nucleic acids by PCR are summarized in Cheng et al.(1994) Nature 369: 684-685 and the references therein, in which PCRamplicons of up to 40 kb are generated. One of skill will appreciatethat essentially any RNA can be converted into a double stranded DNAsuitable for restriction digestion, PCR expansion and sequencing usingreverse transcriptase and a polymerase. See, Ausubel, Sambrook andBerger, all supra.

In one preferred method, reassembled sequences are checked forincorporation of family-based recombination oligonucleotides. This canbe done by cloning and sequencing the nucleic acids, and/or byrestriction digestion, e.g., as essentially taught in Sambrook, Bergerand Ausubel, supra. In addition, sequences can be PCR amplified andsequenced directly. Thus, in addition to, e.g., Sambrook, Berger,Ausubel and Innis (supra), additional PCR sequencing methodologies arealso particularly useful. For example, direct sequencing of PCRgenerated amplicons by selectively incorporating boronated nucleaseresistant nucleotides into the amplicons during PCR and digestion of theamplicons with a nuclease to produce sized template fragments has beenperformed (Porter et al. (1997) Nucleic Acids Research 25(8):1611-1617).In the methods, four PCR reactions on a template are performed, in eachof which one of the nucleotide triphosphates in the PCR reaction mixtureis partially substituted with a 2′deoxynucleoside5′-[P-borano]-triphosphate. The boronated nucleotide is stochasticallyincorporated into PCR products at varying positions along the PCRamplicon in a nested set of PCR fragments of the template. Anexonuclease that is blocked by incorporated boronated nucleotides isused to cleave the PCR amplicons. The cleaved amplicons are thenseparated by size using polyacrylamide gel electrophoresis, providingthe sequence of the amplicon. An advantage of this method is that ituses fewer biochemical manipulations than performing standardSanger-style sequencing of PCR amplicons.

Synthetic genes are amenable to conventional cloning and expressionapproaches; thus, properties of the genes and proteins they encode canreadily be examined after their expression in a host cell. Syntheticgenes can also be used to generate polypeptide products by in vitro(cell-free) transcription and translation. Polynucleotides andpolypeptides can thus be examined for their ability to bind a variety ofpredetermined ligands, small molecules and ions, or polymeric andheteropolymeric substances, including other proteins and polypeptideepitopes, as well as microbial cell walls, viral particles, surfaces andmembranes.

For example, many physical methods can be used for detectingpolynucleotides encoding phenotypes associated with catalysis ofchemical reactions by either polynucleotides directly, or by encodedpolypeptides. Solely for the purpose of illustration, and depending onthe specifics of particular pre-determined chemical reactions ofinterest, these methods may include a multitude of techniques well knownin the art which account for a physical difference between substrate(s)and product(s), or for changes in the reaction media associated withchemical reaction (e.g. changes in electromagnetic emissions,adsorption, dissipation, and fluorescence, whether UV, visible orinfrared (heat)). These methods also can be selected from anycombination of the following: mass-spectrometry; nuclear magneticresonance; isotopically labeled materials, partitioning and spectralmethods accounting for isotope distribution or labeled productformation; spectral and chemical methods to detect accompanying changesin ion or elemental compositions of reaction product(s) (includingchanges in pH, inorganic and organic ions and the like). Other methodsof physical assays, suitable for use in the methods herein, can be basedon the use of biosensors specific for reaction product(s), includingthose comprising antibodies with reporter properties, or those based onin vivo affinity recognition coupled with expression and activity of areporter gene. Enzyme-coupled assays for reaction product detection andcell life-death-growth selections in vivo can also be used whereappropriate. Regardless of the specific nature of the physical assays,they all are used to select a desiredactivity, or combination of desiredactivities, provided or encoded by a biomolecule of interest.

The specific assay used for the selection will depend on theapplication. Many assays for proteins, receptors, ligands and the likeare known. Formats include binding to immobilized components, cell ororganismal viability, production of reporter compositions, and the like.

High throughput assays are particularly suitable for screening librariesemployed in the present invention. In high throughput assays, it ispossible to screen up to several thousand different variants in a singleday. For example, each well of a microtiter plate can be used to run aseparate assay, or, if concentration or incubation time effects are tobe observed, every 5-10 wells can test a single variant (e.g., atdifferent concentrations). Thus, a single standard microtiter plate canassay about 100 (e.g., 96) reactions. If 1536 well plates are used, thena single plate can easily assay from about 100 to about 1500 differentreactions. It is possible to assay several different plates per day;assay screens for up to about 6,000-20,000 different assays (i.e.,involving different nucleic acids, encoded proteins, concentrations,etc.) is possible using the integrated systems of the invention. Morerecently, microfluidic approaches to reagent manipulation have beendeveloped, e.g., by Caliper Technologies (Mountain View, Calif.) whichcan provide very high throughput microfluidic assay methods.

High throughput screening systems are commercially available (see, e.g.,Zymark Corp., Hopkinton, Mass.; Air Technical Industries, Mentor, Ohio;Beckman Instruments, Inc. Fullerton, Calif.; Precision Systems, Inc.,Natick, Mass., etc.). These systems typically automate entire proceduresincluding all sample and reagent pipetting, liquid dispensing, timedincubations, and final readings of the microplate in detector(s)appropriate for the assay. These configurable systems provide highthroughput and rapid start up as well as a high degree of flexibilityand customization.

The manufacturers of such systems provide detailed protocols for varioushigh throughput screening assays. Thus, for example, Zymark Corp.provides technical bulletins describing screening systems for detectingthe modulation of gene transcription, ligand binding, and the like.

A variety of commercially available peripheral equipment and software isavailable for digitizing, storing and analyzing a digitized video ordigitized optical or other assay images, e.g., using PC (Intel x86 orpentium chip-compatible DOS™, OS2™, WINDOWS™, or WINDOWS NT™ basedmachines), MACINTOSH™, or UNIX based (e.g., SUN™ work station)computers.

Systems for analysis typically include a digital computer with softwarefor directing one or more step of one or more of the methods herein,and, optionally, also include, e.g., high-throughput liquid controlsoftware, image analysis software, data interpretation software, arobotic liquid control armature for transferring solutions from a sourceto a destination operably linked to the digital computer, an inputdevice (e.g., a computer keyboard) for entering data to the digitalcomputer to control operations or high throughput liquid transfer by therobotic liquid control armature and, optionally, an image scanner fordigitizing label signals from labeled assay components. The imagescanner can interface with image analysis software to provide ameasurement of probe label intensity. Typically, the probe labelintensity measurement is interpreted by the data interpretation softwareto show whether the labeled probe hybridizes to the DNA on the solidsupport.

Computational hardware and software resources are available that can beemployed in the invention methods described herein (for hardware, anymid-range priced Unix system (e.g., for Sun Microsystems) or even higherend Macintosh or PCs will suffice).

In some embodiments, cells, viral plaques, spores or the like,comprising in vitro oligonucleotide-mediated recombination products orphysical embodiments of in silico recombined nucleic acids, can beseparated on solid media to produce individual colonies (or plaques).Using an automated colony picker (e.g., the Q-bot, Genetix, U.K.),colonies or plaques are identified, picked, and up to 10,000 differentmutants inoculated into 96 well microtiter dishes containing two 3 mmglass balls/well. The Q-bot does not pick an entire colony but ratherinserts a pin through the center of the colony and exits with a smallsampling of cells, (or mycelia) and spores (or viruses in plaqueapplications). The time the pin is in the colony, the number of dips toinoculate the culture medium, and the time the pin is in that mediumeach effect inoculum size, and each parameter can be controlled andoptimized.

The uniform process of automated colony picking such as the Q-botdecreases human handling error and increases the rate of establishingcultures (roughly 10,000/4 hours). These cultures are optionally shakenin a temperature and humidity controlled incubator. Optional glass ballsin the microtiter plates act to promote uniform aeration of cells andthe dispersal of cellular (e.g., mycelial) fragments similar to theblades of a fermentor. Clones from cultures of interest can be isolatedby limiting dilution. As also described supra, plaques or cellsconstituting libraries can also be screened directly for the productionof proteins, either by detecting hybridization, protein activity,protein binding to antibodies, or the like. To increase the chances ofidentifying a pool of sufficient size, a prescreen that increases thenumber of mutants processed by 10-fold can be used. The goal of theprimary screen is to quickly identify mutants having equal or betterproduct titers than the parent strain(s) and to move only these mutantsforward to liquid cell culture for subsequent analysis.

One approach to screening diverse libraries is to use a massivelyparallel solid-phase procedure to screen cells expressing polynucleotidevariants, e.g., polynucleotides that encode enzyme variants. Massivelyparallel solid-phase screening apparatus using absorption, fluorescence,or FRET are available. See, e.g., U.S. Pat. No. 5,914,245 to Bylina, etal. (1999); see also, http://www.kairos-scientific.com/; Youvan et al.(1999) “Fluorescence Imaging Micro-Spectrophotometer (FIMS)”Biotechnology et alia, <www.et-al.com> 1:1-16; Yang et al. (1998) “HighResolution Imaging Microscope (HIRIM)” Biotechnology et alia,<www.et-al.com>4:1-20; and Youvan et al. (1999) “Calibration ofFluorescence Resonance Energy Transfer in Microscopy Using GeneticallyEngineered GFP Derivatives on Nickel Chelating Beads” posted atwww.kairos-scientific.com. Following screening by these techniques,molecules of interest are typically isolated, and optionally sequencedusing methods that are well known in the art. The sequence informationis then used as set forth herein to design a new protein variantlibrary.

Similarly, a number of well-known robotic systems have also beendeveloped for solution phase chemistries useful in assay systems. Thesesystems include automated workstations like the automated synthesisapparatus developed by Takeda Chemical Industries, LTD. (Osaka, Japan)and many robotic systems utilizing robotic arms (Zymate II, ZymarkCorporation, Hopkinton, Mass.; Orca, Beckman Coulter, Inc. (Fullerton,Calif.)) which mimic the manual synthetic operations performed by ascientist. Any of the above devices are suitable for use with thepresent invention, e.g., for high-throughput screening of moleculesencoded by nucleic acids evolved as described herein. The nature andimplementation of modifications to these devices (if any) so that theycan operate as discussed herein will be apparent to persons skilled inthe relevant art.

VII. DIGITAL APPARATUS AND SYSTEMS

As should be apparent, embodiments of the present invention employprocesses acting under control of instructions and/or data stored in ortransferred through one or more computer systems. Embodiments of thepresent invention also relate to apparatus for performing theseoperations. Such apparatus may be specially designed and/or constructedfor the required purposes, or it may be a general-purpose computerselectively activated or reconfigured by a computer program and/or datastructure stored in the computer. The processes presented herein are notinherently related to any particular computer or other apparatus. Inparticular, various general-purpose machines may be used with programswritten in accordance with the teachings herein. In some cases, however,it may be more convenient to construct a specialized apparatus toperform the required method operations. A particular structure for avariety of these machines will appear from the description given below.

In addition, embodiments of the present invention relate to computerreadable media or computer program products that include programinstructions and/or data (including data structures) for performingvarious computer-implemented operations. Examples of computer-readablemedia include, but are not limited to, magnetic media such as harddisks, floppy disks, magnetic tape; optical media such as CD-ROM devicesand holographic devices; magneto-optical media; semiconductor memorydevices, and hardware devices that are specially configured to store andperform program instructions, such as read-only memory devices (ROM) andrandom access memory (RAM), and sometimes application-specificintegrated circuits (ASICs), programmable logic devices (PLDs) andsignal transmission media for delivering computer-readable instructions,such as local area networks, wide area networks, and the Internet. Thedata and program instructions of this invention may also be embodied ona carrier wave or other transport medium (e.g., optical lines,electrical lines, and/or airwaves).

Examples of program instructions include both low-level code, such asproduced by a compiler, and files containing higher level code that maybe executed by the computer using an interpreter. Further, the programinstructions include machine code, source code and any other code thatdirectly or indirectly controls operation of a computing machine inaccordance with this invention. The code may specify input, output,calculations, conditionals, branches, iterative loops, etc.

Standard desktop applications such as word processing software (e.g.,Microsoft Word™ or Corel WordPerfect™) and database software (e.g.,spreadsheet software such as Microsoft Excel™, Corel Quattro Pro™, ordatabase programs such as Microsoft Access™ or Paradox™) can be adaptedto the present invention by inputting one or more character strings intothe software which is loaded into the memory of a digital system, andperforming an operation as noted herein on the character string. Forexample, systems can include the foregoing software having theappropriate character string information, e.g., used in conjunction witha user interface (e.g., a GUI in a standard operating system such as aWindows, Macintosh or LINUX system) to manipulate strings of characters.Specialized alignment programs such as PILEUP and BLAST can also beincorporated into the systems of the invention, e.g., for alignment ofnucleic acids or proteins (or corresponding character strings) as apreparatory step to performing an operation on any aligned sequences.Software for performing PCA (e.g., as is commercially available fromPartek) or other statistical operations can also be included in thedigital system.

Systems typically include, e.g., a digital computer with software foraligning and manipulating sequences according to the operations notedherein, or for performing PCA, neural network analysis or the like, aswell as data sets entered into the software system comprising sequencesor other data to be mapped or manipulated. The computer can be, e.g., aPC (Intel x86 or Pentium chip-compatible DOS™, OS2™, WINDOWS™, WINDOWSNT™, WINDOWS95™, WINDOWS98™, LINUX, Apple-compatible, MACINTOSH™compatible, Power PC compatible, or a UNIX compatible (e.g., SUN™ workstation or machine) or other common commercially available computerwhich is known to one of skill. Software for aligning or otherwisemanipulating sequences can be constructed by one of skill using astandard programming language such as VisualBasic, Fortran, Basic, Java,or the like, according to the methods herein.

Any controller or computer optionally includes a monitor which caninclude, e.g., a cathode ray tube (“CRT”) display, a flat panel display(e.g., active matrix liquid crystal display, liquid crystal display), orothers. Computer circuitry is often placed in a box which includesnumerous integrated circuit chips, such as a microprocessor, memory,interface circuits, and others. The box also optionally includes a harddisk drive, a floppy disk drive, a high capacity removable drive such asa writeable CD-ROM, and other common peripheral elements. Inputtingdevices such as a keyboard or mouse optionally provide for input from auser and for user selection of sequences to be compared or otherwisemanipulated in the relevant computer system.

The computer typically includes appropriate software for receiving userinstructions, either in the form of user input into a set parameterfields, e.g., in a GUI, or in the form of preprogrammed instructions,e.g., preprogrammed for a variety of different specific operations. Thesoftware then converts these instructions to appropriate language forinstructing the system to carry out any desired operation. For example,in addition to performing statistical manipulations of data space, adigital system can instruct an oligonucleotide synthesizer to synthesizeoligonucleotides for gene reconstruction, or even to orderoligonucleotides from commercial sources (e.g., by printing appropriateorder forms or by linking to an order form on the internet).

The digital system can also include output elements for controllingnucleic acid synthesis (e.g., based upon a sequence or an alignment of asequences herein), i.e., an integrated system of the inventionoptionally includes an oligonucleotide synthesizer or an oligonucleotidesynthesis controller. The system can include other operations whichoccur downstream from an alignment or other operation performed using acharacter string corresponding to a sequence herein, e.g., as notedabove with reference to assays.

In one example, code embodying methods of the invention are embodied ina fixed media or transmissible program component containing logicinstructions and/or data that when loaded into an appropriatelyconfigured computing device causes the device to perform a geneticoperator on one or more character string. FIG. 16 shows an exampledigital device 2200 that should be understood to be a logical apparatusthat can read instructions from media 2217, network port 2219, userinput keyboard 2209, user input 2211 or other inputting means. Apparatus2200 can thereafter use those instructions to direct statisticaloperations in data space, e.g., to construct one or more data set (e.g.,to determine a plurality of representative members of the data space).One type of logical apparatus that can embody the invention is acomputer system as in computer system 2200 comprising CPU 2207, optionaluser input devices keyboard 2209, and GUI pointing device 2211, as wellas peripheral components such as disk drives 2215 and monitor 2205(which displays GO modified character strings and provides forsimplified selection of subsets of such character strings by a user.Fixed media 2217 is optionally used to program the overall system andcan include, e.g., a disk-type optical or magnetic media or otherelectronic memory storage element. Communication port 2219 can be usedto program the system and can represent any type of communicationconnection.

The invention can also be embodied within the circuitry of anapplication specific integrated circuit (ASIC) or programmable logicdevice (PLD). In such a case, the invention is embodied in a computerreadable descriptor language that can be used to create an ASIC or PLD.The invention can also be embodied within the circuitry or logicprocessors of a variety of other digital apparatus, such as PDAs, laptopcomputer systems, displays, image editing equipment, etc.

In one preferred aspect, the digital system comprises a learningcomponent where the outcomes of physical oligonucleotide assemblyschemes (compositions, abundance of products, different processes) aremonitored in conjunction with physical assays, and correlations areestablished. Successful and unsuccessful combinations are documented ina database to provide justification/preferences for user-base or digitalsystem based selection of sets of parameters for subsequent processesdescribed herein involving the same set of parental characterstrings/nucleic acids/proteins (or even unrelated sequences, where theinformation provides process improvement information). The correlationsare used to modify subsequent processes of the invention, e.g., tooptimize the particular process. This cycle of physical synthesis,selection and correlation is optionally repeated to optimize the system.For example, a learning neural network can be used to optimize outcomes.

VIII. EMBODIMENTS IN WEBSITES

The Internet includes computers, information appliances, and computernetworks that are interconnected through communication links. Theinterconnected computers exchange information using various services,such as electronic mail, ftp, the World Wide Web (“WWW”) and otherservices, including secure services. The WWW service can be understoodas allowing a server computer system (e.g., a Web server or a Web site)to send web pages of information to a remote client informationappliance or computer system. The remote client computer system can thendisplay the web pages. Generally, each resource (e.g., computer or webpage) of the WWW is uniquely identifiable by a Uniform Resource Locator(“URL”). To view or interact with a specific web page, a client computersystem specifies a URL for that web page in a request. The request isforwarded to a server that supports that web page. When the serverreceives the request, it sends that web page to the client informationsystem. When the client computer system receives that web page, it candisplay the web page using a browser or can interact with the web pageor interface as otherwise provided. A browser is a logic module thateffects the requesting of web pages and displaying or interacting withweb pages.

Currently, displayable web pages are typically defined using a HyperText Markup Language (“HTML”). HTML provides a standard set of tags thatdefine how a web page is to be displayed. An HTML document containsvarious tags that control the displaying of text, graphics, controls,and other features. The HTML document may contain URLs of other Webpages available on that server computer system or other server computersystems. URLs can also indicate other types of interfaces, includingsuch things as CGI scripts or executable interfaces, that informationappliances use to communicate with remote information appliances orservers without necessarily displaying information to a user.

The Internet is especially conducive to providing information servicesto one or more remote customers. Services can include items (e.g., musicor stock quotes) that are delivered electronically to a purchaser overthe Internet. Services can also include handling orders for items (e.g.,groceries, books, or chemical or biologic compounds, etc.) that may bedelivered through conventional distribution channels (e.g., a commoncarrier). Services may also include handling orders for items, such asairline or theater reservations, that a purchaser accesses at a latertime. A server computer system may provide an electronic version of aninterface that lists items or services that are available. A user or apotential purchaser may access the interface using a browser and selectvarious items of interest. When the user has completed selecting theitems desired, the server computer system may then prompt the user forinformation needed to complete the service. This transaction-specificorder information may include the purchaser's name or otheridentification, an identification for payment (such as a corporatepurchase order number or account number), or additional informationneeded to complete the service, such as flight information.

NCBI Databases and Software

Among services of particular interest that can be provided over theinternet and over other networks are biological data and biologicaldatabases. Such services include a variety of services provided by theNational Center for Biotechnology Information (NCBI) of the NationalInstitutes of Health (NIH). NCBI is charged with creating automatedsystems for storing and analyzing knowledge about molecular biology,biochemistry, and genetics; facilitating the use of such databases andsoftware by the research and medical community; coordinating efforts togather biotechnology information both nationally and internationally;and performing research into advanced methods of computer-basedinformation processing for analyzing the structure and function ofbiologically important molecules.

NCBI holds responsibility for the GenBank® DNA sequence database. Thedatabase has been constructed from sequences submitted by individuallaboratories and by data exchange with the international nucleotidesequence databases, the European Molecular Biology Laboratory (EMBL) andthe DNA Database of Japan (DDBJ), and includes patent sequence datasubmitted to the U.S. Patent and Trademark Office. In addition toGenBank®, NCBI supports and distributes a variety of databases for themedical and scientific communities. These include the Online MendelianInheritance in Man (OMIM), the Molecular Modeling Database (MMDB) of 3Dprotein structures, the Unique Human Gene Sequence Collection (UniGene),a Gene Map of the Human Genome, the Taxonomy Browser, and the CancerGenome Anatomy Project (CGAP), in collaboration with the National CancerInstitute. Entrez is NCBI's search and retrieval system that providesusers with integrated access to sequence, mapping, taxonomy, andstructural data. Entrez also provides graphical views of sequences andchromosome maps. A feature of Entrez is the ability to retrieve relatedsequences, structures, and references. BLAST, as described herein, is aprogram for sequence similarity searching developed at NCBI foridentifying genes and genetic features that can execute sequencesearches against the entire DNA database. Additional software toolsprovided by NCBI include: Open Reading Frame Finder (ORF Finder),Electronic PCR, and the sequence submission tools, Sequin and BankltNCBI's various databases and software tools are available from the WWWor by FTP or by e-mail servers. Further information is available atwww.ncbi.nlm.nih.gov.

Some biological data available over the internet is data that isgenerally viewed with a special browser “plug-in” or other executablecode. One example of such a system is CHIME, a browser plug-in thatallows an interactive virtual 3-dimensional display of molecularstructures, including biological molecular structures. Furtherinformation regarding CHIME is available at www.mdlchime.com/chime/.

Online Oligos, Gene, or Protein Ordering

A variety of companies and institutions provide online systems forordering biological compounds. Examples of such systems can be found atwww.genosys.com/oligo_custinfo.cfm orwww.genomictechnologies.com/Qbrowser2_FP.html. Typically, these systemsaccept some descriptor of a desired biological compound (such as anoligonucleotide, DNA strand, RNA strand, amino acid sequence, etc.) andthen the requested compound is manufactured and is shipped to thecustomer in a liquid solution or other appropriate form.

To further illustrate, the methods of this invention can be implementedin a localized or distributed computing environment. In a distributedenvironment, the methods may be implemented on a single computercomprising multiple processors or on a multiplicity of computers. Thecomputers can be linked, e.g. through a common bus, but more preferablythe computer(s) are nodes on a network. The network can be a generalizedor a dedicated local or wide-area network and, in certain preferredembodiments, the computers may be components of an Intranet or anInternet.

In one internet embodiment, a client system typically executes a Webbrowser and is coupled to a server computer executing a Web server. TheWeb browser is typically a program such as IBM's Web Explorer,Microsoft's Internet explorer, NetScape, Opera, or Mosaic. The Webserver is typically, but not necessarily, a program such as IBM's HTTPDaemon or other www daemon (e.g., LINUX-based forms of the program). Theclient computer is bi-directionally coupled with the server computerover a line or via a wireless system. In turn, the server computer isbi-directionally coupled with a website (server hosting the website)providing access to software implementing the methods of this invention.

As mentioned, a user of a client connected to the Intranet or Internetmay cause the client to request resources that are part of the website(s) hosting the application(s) providing an implementation of themethods of this invention. Server program(s) then process the request toreturn the specified resources (assuming they are currently available).The standard naming convention (i.e., Uniform Resource Locator (“URL”))encompasses several types of location names, presently includingsubclasses such as Hypertext Transport Protocol (“http”), File TransportProtocol (“ftp”), gopher, and Wide Area Information Service (“WAIS”).When a resource is downloaded, it may include the URLs of additionalresources. Thus, the user of the client can easily learn of theexistence of new resources that he or she had not specificallyrequested.

The software implementing the method(s) of this invention can runlocally on the server hosting the website in a true client-serverarchitecture. Thus, the client computer posts requests to the hostserver which runs the requested process(es) locally and then downloadsthe results back to the client. Alternatively, the methods of thisinvention can be implemented in a “multi-tier” format in which acomponent of the method(s) are performed locally by the client. This canbe implemented by software downloaded from the server on request by theclient (e.g. a Java application) or it can be implemented by software“permanently” installed on the client.

In one embodiment the application(s) implementing the methods of thisinvention are divided into frames. In this paradigm, it is helpful toview an application not so much as a collection of features orfunctionality but, instead, as a collection of discrete frames or views.A typical application, for instance, generally includes a set of menuitems, each of with invokes a particular frame—that is, a form whichmanifest certain functionality of the application. With thisperspective, an application is viewed not as a monolithic body of codebut as a collection of applets, or bundles of functionality. In thismanner from within a browser, a user would select a Web page link whichwould, in turn, invoke a particular frame of the application (i.e., asub-application). Thus, for example, one or more frames may providefunctionality for inputting and/or encoding biological molecule(s) intoone or more data spaces, while another frame provides tools for refininga model of the data space.

In certain embodiments, the methods of this invention are implemented asone or more frames providing, e.g., the following functionalit(ies).Function(s) to encode two or more biological molecules into characterstrings to provide a collection of two or more different initialcharacter strings wherein each of said biological molecules comprises aselected set of subunits; functions to select at least two substringsfrom the character strings; functions to concatenate the substrings toform one or more product strings about the same length as one or more ofthe initial character strings; functions to add (place) the productstrings to a collection of strings, and functions to implement anyfeature set forth herein.

The functions to distribute two or more biological molecules into dataspace can provide one or more windows wherein the user can insertrepresentation(s) of biological molecules. In addition, the encodingfunction also, optionally, provides access to private and/or publicdatabases accessible through a local network and/or the intranet wherebyone or more sequences contained in the databases can be input into themethods of this invention. Thus, for example, in one embodiment, wherethe end user inputs a nucleic acid sequenced into the encoding function,the user can, optionally, have the ability to request a search ofGenBank® and input one or more of the sequences returned by such asearch into the encoding and/or diversity generating function.

Methods of implementing Intranet and/or Intranet embodiments ofcomputational and/or data access processes are well known to those ofskill in the art and are documented in great detail (see, e.g., Cluer etal. (1992) “A General Framework for the Optimization of Object-OrientedQueries,” Proc SIGMOD International Conference on Management of Data,San Diego, Calif., Jun. 2-5, 1992, SIGMOD Record, vol. 21, Issue 2,June, 1992; Stonebraker, M., Editor; ACM Press, pp. 383-392; ISO-ANSI,Working Draft, “Information Technology-Database Language SQL,” JimMelton, Editor, International Organization for Standardization andAmerican National Standards Institute, July 1992; Microsoft Corporation,“ODBC 2.0 Programmer's Reference and SDK Guide. The Microsoft OpenDatabase Standard for Microsoft Windows™ and Windows NT™ , MicrosoftOpen Database Connectivity™ Software Development Kit,” 1992, 1993, 1994Microsoft Press, pp. 3-30 and 41-56; ISO Working Draft, “DatabaseLanguage SQL-Part 2: Foundation (SQL/Foundation),” CD9075-2:199.chi.SQL,Sep. 11, 1997, and the like). Additional relevant details regardingweb-based applications are found in WO 00/42559, entitled “METHODS OFPOPULATING DATA STRUCTURES FOR USE IN EVOLUTIONARY SIMULATIONS,” bySelifonov and Stemmer.

IX. EXAMPLES Identifying Functional Contraints in Proteins by SyntheticDNA Shuffling

The following non-limiting example is offered only by way ofillustration.

Protein evolution is manifested by amino acid changes in the codingsequence. These amino acid changes are constrained by continuousselective pressure for function, resulting in independent and correlatedchanges in a protein's descendents. This section presents a method fordifferentiating covariation between amino acids reflecting functionalselection, from covariation that simply results from a common ancestralorigin.

Functional screening and sequencing of sequences suggests that most ofthe covariation observed in naturally occurring sequences results fromphylogenetic descent, rather than functional constraints. The functionalcovariations that are identified are mainly in local structuralelements, but there is also some covariation occurring over longerdistances in genes/proteins. In general, genes and proteins are veryplastic and have evolved to minimize the interdependence of allowedamino acid changes to facilitate adaptation.

During divergent evolution, protein sequences change while thebiochemical function of the protein is generally retained. Correlatedchange between functionally linked residues in a protein provide for thepreservation of protein structure and function throughout theevolutionary process. The functional link between the covarying residuescan be due, e.g., to structural contact or an indirect effect throughinteractions with substrates, products, cofactors or other proteins.Independent mutations among functionally linked residues are oftendisadvantageous, but two simultaneous mutations may allow the protein toretain function. Alternatively, two or more residues may covary simplydue to a common ancestral origin. Current analytical tools are limitedin the ability to separate the functional from the phylogenetic(ancestral) covariation in a family of orthologous proteins. Statisticaltools are limited both by the amount of data to infer covariation andalso limited by the evolutionary models to explain the data. See,Wollenberg, K. R. & Atchley, W. R. Separation of phylogenetic andfunctional associations in biological sequences by using the parametricbootstrap. Proc. Nat'l Acad. Sci 97, 3288-91. (2000); Gaucher, E. A.,Miyamoto, M. M. & Benner, S. A. Function-structure analysis of proteinsusing covarion-based evolutionary approaches: Elongation factors. Proc.Nat'l Acad. Sci 98, 548-552 (2001); Larson, S. M., Di Nardo, A. A. &Davidson, A. R. Analysis of covariation in an SH3 domain sequencealignment: applications in tertiary contact prediction and the design ofcompensating hydrophobic core substitutions. J Mol Biol 303, 433-46.(2000); Pollock, D. D., Taylor, W. R. & Goldman, N. Coevolving proteinresidues: maximum likelihood identification and relationship tostructure. J Mol Biol 287, 187-98. (1999; and Atchley, W. R.,Wollenberg, K. R., Fitch, W. M., Terhalle, W. & Dress, A. W.Correlations among amino acid sites in bHLH protein domains: aninformation theoretic analysis. Mol Biol Evol 17, 164-78. (2000).

If sequential point mutations are the primary mechanism for divergentevolution, most amino acid changes should occur independently: twosimultaneous mutations will be extremely rare (e.g., at the rate of onemutation per 10⁹ base pairs for a single cell division in E. coli).

Here an experiment is described in which all amino acids in a family ofproteins are deliberately uncoupled by synthetic DNA shuffling (i.e.,recombination of synthetic oligonucleotides that collectively correspondin sequence to a set of parental polynucleotides). By allowing allresidues to vary independent of context and then screening for function,any covariation derived from common ancestral origin is eliminated andonly covariation that contributes to function is retained. Functionalvariants are analyzed using mutual information theory to assesscovariation between residues. Most of the covariation observed among theparental sequences is not preserved in functional chimeric proteins,indicating that it is primarily a measure of common ancestral descent.The methods also identify covarying residues that are not seen among theparents due to sampling effects.

Synthetic shuffling can be performed in a homology independent methodthat allows an essentially equal probability of each allowed residue atany given position to be incorporated into the final product. See, e.g.,WO 00/42561 by Crameri et al., “Oligonucleotide Mediated Nucleic AcidRecombination” and Ness, J., Minshull, J. & Kim, S. Synthetic Shuffling.Nature Biotech Submitted (2001)). This is in contrast to many otherrecombination formats where the distribution of any single residue isdependent on its abundance and context among the parental genes.Synthetic shuffling results in a library of sequences that arecompletely chimeric on the single residue level and rich in naturaldiversity.

Despite the vast total size of libraries which can be generated bysynthetic shuffling, characterization of only a small subset of thelibrary is sufficient to test a significant number of covarying residuepairs for correlation with function. Any pair of covarying amino acidresidues is sampled many times over among the fully characterizedvariants. Libraries generated through synthetic shuffling are anexcellent unbiased source of data to analyze the relative importance ofcovariance and its distribution in a biological system.

Characterizing the distribution of a pre-screened library allows one tonormalize the covariation found among the active variants to theinherent distribution of covariance the library. Any spuriousartifactual mutual information derived from an imperfect library (forexample oligonucleotide degeneracy biases produced during synthesis) canbe eliminated. In general, there is no, or very little, difference inthe sequence diversity distribution between pre-screened and activevariants. In both cases, the variants are evenly distributed, suggestingno significant bias towards diversity originating from any given parentor cluster of parents. This shows that new regions of sequence space canbe explored for functional activity by distributing the characterizedvariants evenly across the same sequence space covered by parentalgenes. Sequence distance traversed using classic directed evolutiontechniques such as random mutagenesis is usually limited to 1-3 aminoacid residues per gene per round. Most of the solutions found throughsynthetic shuffling are consequently inaccessible by random mutagenesis.

Covariation between residues inferred from biological sequence data canbe attributed to either functional constraints or phylogeneticrelationships. Since one generally does not know the historical originof the sequences at issue (at least where the sequences are naturallyoccurring), one cannot de-convolute the covariant nature of residuesinvolved. This issue has typically been addressed either throughcollecting as many sequences as possible under a given node in aphylogenetic tree, or by computer simulations of possible evolutionarypaths using a model for sequence evolution. Both approaches havesignificant complications and drawbacks. An inherent complication of thefirst type of covariation analysis is the inclusion of sequences havingdiverged not only in neutral mutations, but also in function. Thedivergence can be small, as in evolving to a slightly different pHoptimum, or large as in evolving to catalyze a related but differentreaction. No single orthologous enzyme pair has truly evolved for theexact same physiological conditions. Including sequences in thecovariation analysis that have diverged in function adds noise to thecorrelations, as they are subjected to different selective pressures.Another, perhaps more serious concern, is the inability to ever gatherall sequences under a phylogenetic node to ensure that the distributionin the data set is unbiased due to sampling effects. In a libraryproduced by synthetic shuffling, all inherent covariation is removed andamino acid diversity occurring in any one position has an equalprobability of occurring in any variant. Screening such a library (e.g.,in vitro) for a defined biochemical function, identifies all covariationderived from functional constraints required for the assayed biologicalactivity of the enzyme. The remainder of the covariation found among theparental genes, but not present among the functional progeny, isconsequently the result of common ancestral origin.

The covariation among a set of variants from the library can be assessedand visualized by aligning the sequences and removing residues that areconserved throughout the alignment. The mutual information between eachvarying residue pairs is plotted in a two-dimensional matrix. Eachrow/column represents one of the varying residue positions for a proteinand each cell in the matrix represents a possible residue pair. A filledcell of the matrix corresponds to highly covarying residues. Eachparental sequence has evolved independently through natural selectionand their phylogenetic distribution is highly clustered. Displayingevery residue pair for the parental genes identifies many residue pairsthat covary. The mutual information distribution is normalized to have amean of 0 and variance of 1. Covariation here is defined as residuepairs with mutual information higher than 2 deviations for thatalignment.

After making the synthetic library, but before exposing the variants toany selective pressure, variants are isolated. These unscreened variantsare characterized for covariation in the same way as the parental genes.In most cases, the distribution of the varying residues is uniform, withall varying residues exist in conjunction with all other varyingresidues. To the extent that there is covariation, that covariation isnot the result of functional constraints (i.e., the variants have notbeen exposed to selection). This in effect, is a control of the questionof whether the covariation is a result of functional constraints. Aftersynthetic shuffling and selection for function, covarying residue pairsthat are identified are the result of functional constraints. Thecovariation found among the parental genes and not among thefunctionally active library variants could also reflect a selectivepressure for indirect effects on the organism. Indirect effects couldpotentially be any trait, such as sequestering of cofactors or cellularlocalization, etc. that is not specifically related to the screeningcriteria of the selection assay.

1. Mutational Information Analysis

In a protein alignment, the entropy measure for each position in thealignment indicates the degree of variability and preference for eachamino acid. The following equation is used to quantify site-entropy(Shannon, C. E. The mathematical theory of communication. 1963. MDComput 14, 306-17. (1997)).

I _(i) =ΣkP(A ^(k) _(i)) log P(A ^(k) _(i))   (1)

Where the sum is over all k amino acids {A^(k) _(i)} occurring atposition i in the alignment. P(A^(k) _(i)) is the probability of aminoacid k at position i. Likewise, covariance between amino acids can bemeasured by using the mutual information content between pairs of sites.

$\begin{matrix}{{MI}_{ij} = {\sum\limits_{k}\; {\sum\limits_{l}\; {{P\left( {A_{i}^{k}\mspace{14mu} {and}\mspace{14mu} A_{j}^{l}} \right)}\log \frac{P\left( {A_{i}^{k}\mspace{14mu} {and}\mspace{14mu} A_{j}^{l}} \right)}{{P\left( A_{i}^{k}\; \right)}{P\left( A_{j}^{l}\; \right)}}}}}} & (2)\end{matrix}$

The double summation is over all possible pairs of amino acids {A^(k)_(i)} and {A^(l) _(j)} at positions i and j respectively. P(A^(k) _(i))is the probability of amino acid k at position i and P(A^(k) _(i) andA^(l) _(j)) is combined probability of amino acid k at position i andamino acid 1 at position j.

The MI values are normalized for each group of variants to have the samemean of 0.0 and standard deviation of 1.0. The degree of co-variationamong any residue pair is identified by the deviation of the MI for thegiven pair from the expected mutual information content.

While the foregoing invention has been described in some detail forpurposes of clarity and understanding, it will be clear to one skilledin the art from a reading of this disclosure that various changes inform and detail can be made without departing from the true scope of theinvention. For example, all the techniques and apparatus described abovemay be used in various combinations. All publications, patents, patentapplications, or other documents cited in this application areincorporated by reference in their entirety for all purposes to the sameextent as if each individual publication, patent, patent application, orother document were individually indicated to be incorporated byreference for all purposes.

1. A method for identifying nucleotides for variation in nucleic acidsencoding a protein variant library, said method comprising: (a)receiving data characterizing a training set of a protein variantlibrary, wherein the data comprises activity and a nucleotide sequencefor each protein variant in the training set; (b) from the data,developing a sequence activity model for predicting activity frommultiple independent variables, each specifying the presence or absenceof a specific nucleotide in a sequence; (c) using the sequence activitymodel to identify one or more nucleotides that are to be varied or fixedin order to impact the desired activity; and (d) generating a newprotein variant library containing one or more new protein variantshaving amino acid sequences encoded by nucleic acids in which theidentified nucleotides are varied or fixed in order as identified in(c).
 2. The method of claim 1, wherein the independent variables do notrepresent physical or chemical properties of amino acids.
 3. The methodof claim 1, wherein the independent variables represent identities ofthe specific nucleotides without reference to physical or chemicalproperties that characterize amino acids.
 4. The method of claim 1,wherein the independent variables have associated coefficientsspecifying a magnitude of contribution of the specific nucleotides attheir corresponding positions to said activity.
 5. The method of claim1, wherein the presence or absence of specific nucleotides, as specifiedby the independent variables, is represented by bit values.
 6. Themethod of claim 1, wherein the using the sequence activity model in (c)comprises identifying the one or more nucleotides that are to be variedor fixed in a reference nucleotide sequence.
 7. The method of claim 1,further comprising: (e) assaying the new protein variant library toprovide activity information for members of the new protein variantlibrary to select a protein for production; and (f) producing theprotein selected in (e).
 8. The method of claim 1, further comprising:(e) assaying the new protein variant library to provide an updatedtraining set comprising sequence and activity information for members ofthe new protein variant library; (f) developing a new sequence activitymodel from the updated training set; and (g) using the new sequenceactivity model to identify one or more nucleotides in a new referencenucleotide sequence that are to be varied or fixed in order to impactthe desired activity.
 9. The method of claim 1, wherein the proteinvariant library of operation (a) comprises proteins that are encoded bymembers of a single gene family.
 10. The method of claim 1, wherein theprotein variant library of step (a) comprises proteins that are obtainedby using a recombination-based diversity generation mechanism.
 11. Themethod of claim 1, further comprising developing a new sequence activitymodel using activity and sequence data characterizing new proteins ofthe new protein variant library.
 12. The method of claim 1, wherein thesequence activity model is a regression model.
 13. The method of claim1, wherein the sequence activity model is a partial least squares model.14. The method of claim 1, wherein the sequence activity model is aneural network.
 15. A method for identifying nucleotides for variationin nucleic acids encoding a protein variant library, said methodcomprising: (a) receiving data characterizing a training set of aprotein variant library, wherein the data comprises activity and anucleotide sequence for each protein variant in the training set; (b)from the data, developing a sequence activity model for predictingactivity from multiple independent variables, each specifying thepresence or absence of a specific nucleotide, wherein the sequenceactivity model comprises indicators representing the impact ofcorresponding specific nucleotides on activity; (c) using the sequenceactivity model to identify one or more nucleotides that are to be variedor fixed in order to impact the desired activity; and (d) generating anew protein variant library containing one or more new protein variantshaving amino acid sequences encoded by nucleic acids in which theidentified nucleotides are varied or fixed in order as identified in(c).
 16. The method of claim 15, wherein the independent variablesrepresent identities of the specific nucleotides without reference tophysical or chemical properties that characterize amino acids.
 17. Themethod of claim 15, wherein the presence or absence of specificnucleotides, as specified by the independent variables, is representedby bit values.
 18. The method of claim 15, wherein the using thesequence activity model in (c) comprises identifying the one or morenucleotides that are to be varied or fixed in a reference nucleotidesequence.
 19. The method of claim 15, further comprising: (e) assayingthe new protein variant library to provide activity information formembers of the new protein variant library to select a protein forproduction; and (f) producing the protein selected in (e).
 20. Themethod of claim 1, wherein the sequence activity model is a regressionmodel.
 21. A computer program product comprising a non-transitorymachine readable medium storing program code for identifying nucleotidesfor variation in nucleic acids encoding a protein variant library, saidprogram code comprising: (a) code for receiving data characterizing atraining set of a protein variant library, wherein the data comprisesactivity and a nucleotide sequence for each protein variant in thetraining set; (b) code for using the data to develop a sequence activitymodel for predicting activity from multiple independent variables, eachspecifying the presence or absence of a specific nucleotide in asequence; (c) code for using the sequence activity model to identify oneor more nucleotides that are to be varied or fixed in order to impactthe desired activity; and (d) code for defining a new protein variantlibrary containing one or more new protein variants having amino acidsequences encoded by nucleic acids in which the identified nucleotidesare varied or fixed in order by executing the code in (c).
 22. Thecomputer program product of claim 21, wherein the independent variablesdo not represent physical or chemical properties of amino acids.
 23. Thecomputer program product of claim 21, wherein the independent variablesrepresent identities of the specific nucleotides without reference tophysical or chemical properties that characterize amino acids.
 24. Thecomputer program product of claim 21, wherein the independent variablesof the sequence activity model have associated coefficients specifying amagnitude of contribution of the specific nucleotides at theircorresponding positions to said activity.
 25. The computer programproduct of claim 21, wherein the presence or absence of specificnucleotides, as specified by the independent variables, is representedby bit values.
 26. The computer program product of claim 21, wherein thecode for using the sequence activity model comprises code foridentifying the one or more nucleotides that are to be varied or fixedin a reference nucleotide sequence.
 27. The computer program product ofclaim 21, further comprising code for developing a new sequence activitymodel using activity and sequence data characterizing new proteins ofthe new protein variant library.
 28. The computer program product ofclaim 27, further comprising code for using the new sequence activitymodel to identify one or more nucleotides in a new reference nucleotidesequence that are to be varied or fixed in order to impact the desiredactivity.
 29. The computer program product of claim 21, wherein thesequence activity model is a regression model.
 30. The computer programproduct of claim 21, wherein the sequence activity model is a partialleast squares model.
 31. The computer program product of claim 21,wherein the sequence activity model is a neural network.
 32. A computerprogram product comprising a non-transitory machine readable medium onwhich is provided program code for identifying nucleotides for variationin nucleic acids encoding a protein variant library, said program codecomprising: (a) code for receiving data characterizing a training set ofa protein variant library, wherein the data comprises activity and anucleotide sequence for each protein variant in the training set; (b)code for using the data to develop a sequence activity model forpredicting activity from multiple independent variables, each specifyingthe presence or absence of a specific nucleotide, wherein the sequenceactivity model comprises indicators representing the impact ofcorresponding specific nucleotides on activity; (c) code for using thesequence activity model to identify one or more nucleotides that are tobe varied or fixed in order to impact the desired activity; and (d) codefor defining a new protein variant library containing one or more newprotein variants having amino acid sequences encoded by nucleic acids inwhich the identified nucleotides are varied or fixed in order asidentified by executing the code in (c).
 33. The computer programproduct of claim 32, wherein the independent variables representidentities of the specific nucleotides without reference to physical orchemical properties that characterize amino acids.
 34. The computerprogram product of claim 32, wherein the presence or absence of specificnucleotides, as specified by the independent variables, is representedby bit values.
 35. The computer program product of claim 32, wherein theusing the sequence activity model in (c) comprises identifying the oneor more nucleotides that are to be varied or fixed in a referencenucleotide sequence.
 36. The computer program product of claim 32,wherein the sequence activity model is a regression model.